Grounding

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

MolmoPoint is a new VLM architecture that enables more precise and efficient visual grounding by using special tokens to directly select from the model's internal visual representation instead of generating text coordinates.

Jun 5, 2026

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new family of open-source video-language models that achieve state-of-the-art performance through novel datasets and training methods, particularly excelling in video grounding tasks without relying on proprietary models.

Jun 3, 2026