MolmoPoint: Better Pointing for VLMs with Grounding Tokens
Jun 5, 2026·,,,,,,,,,,·
0 min read
Christopher Clark
Yue Yang
Jae Sung Park
Zixian Ma
Jieyu Zhang
Rohun Tripathi
Mohammadreza Salehi
Sangho Lee
Taira Anderson
Winson Han
Ranjay Krishna
Abstract
Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive approach using grounding tokens that directly select visual tokens from the input video or image. MolmoPoint scores coarse-grained image patches using the LLM’s hidden states, then scores fine-grained subpatches from the highest scoring patch using ViT image features, and then selects a point within the highest scoring subpatch. We show that this approach is more efficient and leads to better performance on diverse pointing, counting, and tracking benchmarks across single image, multi-image, and video tasks.
Type
Publication
Arxiv 2026