MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Jun 5, 2026·

Christopher Clark

Yue Yang

Jae Sung Park

Zixian Ma

Jieyu Zhang

Rohun Tripathi

Mohammadreza Salehi

Sangho Lee

Taira Anderson

Winson Han

Ranjay Krishna

· 0 min read

PDF Cite Project Source Document

Abstract

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive approach using grounding tokens that directly select visual tokens from the input video or image. MolmoPoint scores coarse-grained image patches using the LLM’s hidden states, then scores fine-grained subpatches from the highest scoring patch using ViT image features, and then selects a point within the highest scoring subpatch. We show that this approach is more efficient and leads to better performance on diverse pointing, counting, and tracking benchmarks across single image, multi-image, and video tasks.

Type

Preprint

Publication

Arxiv 2026

Last updated on Jun 5, 2026

Multimodal Vision-Language Models Grounding Pointing Interpretability MolmoPoint

MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation Jun 4, 2026 →