MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation

Jun 4, 2026·

Abhay Deshpande

Maya Guru

Rose Hendrix

Snehal Jauhri

Ainaz Eftekhar

Rohun Tripathi

Max Argus

Jordi Salvador

Haoquan Fang

Matthew Wallingford

Wilbert Pumacay

Yejin Kim

Quinn Pfeifer

Ying-Chun Lee

Piper Wolters

Omar Rayyan

Mingtong Zhang

Jiafei Duan

Karen Farley

Winson Han

Eli Vanderbilt

Dieter Fox

Ali Farhadi

Georgia Chalvatzaki

Dhruv Shah

Ranjay Krishna

· 0 min read

PDF Cite Code Dataset Project Source Document

Abstract

Procedural environment generation and large-scale simulation have shown promise in training robust robotic policies. However, zero-shot transfer of these policies to diverse real-world tasks remains a challenge. We introduce MolmoB0T, a suite of general-purpose manipulation policies trained on a massive dataset of 2.5 million expert trajectories in simulation. We leverage a diverse set of over 10,000 articulated objects and 500 procedurally generated environments to ensure broad coverage of manipulation tasks. We train three policy classes - MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the pi_0 architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms - the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real-world evaluations across 4 settings, outperforming pi_0.5 at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world.

Type

Preprint

Publication

ICRA 2026 SDRL Workshop

Last updated on Jun 4, 2026

Robotics Manipulation Vision-Language Models Simulation Zero-Shot Transfer MolmoBot

← MolmoPoint: Better Pointing for VLMs with Grounding Tokens Jun 5, 2026

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Jun 3, 2026 →