Video Understanding

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new family of open-source video-language models that achieve state-of-the-art performance through novel datasets and training methods, particularly excelling in video grounding tasks without relying on proprietary models.

Jun 3, 2026

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

VideoNet is a large-scale domain-specific action recognition benchmark and training dataset with 1,000 distinct actions across 37 domains, designed to revitalize action recognition evaluation for modern vision-language models.

Jun 2, 2026