MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Jun 5, 2026·
Christopher Clark
,
Yue Yang
,
Jae Sung Park
,
Zixian Ma
,
Jieyu Zhang
,
Rohun Tripathi
,
Mohammadreza Salehi
,
Sangho Lee
,
Taira Anderson
,
Winson Han
,
Ranjay Krishna
· 0 min read
Abstract
Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive approach using grounding tokens that directly select visual tokens from the input video or image. MolmoPoint scores coarse-grained image patches using the LLM’s hidden states, then scores fine-grained subpatches from the highest scoring patch using ViT image features, and then selects a point within the highest scoring subpatch. We show that this approach is more efficient and leads to better performance on diverse pointing, counting, and tracking benchmarks across single image, multi-image, and video tasks.
Type
Publication
Arxiv 2026