What they did
The authors constructed MolmoWebMix, a large training mixture that blends over 100K synthetic browser task trajectories generated through multiple complementary pipelines with more than 30K human-collected demonstrations, atomic web-skill trajectories, and GUI perception data (referring expression grounding and screenshot question answering). The synthetic data pipelines appear designed to cover diverse web environments and task types.
MolmoWeb agents are trained as instruction-conditioned visual-language action policies: at each step they receive a task description and a screenshot, then predict the next browser action (click coordinates, typing, scrolling, etc.). Crucially, they require no access to page HTML, accessibility trees, or browser-specific APIs—operating purely from pixels. The authors trained 4B and 8B parameter versions and evaluated them on three established benchmarks: WebVoyager, Online-Mind2Web, and DeepShop.
Key findings
- MolmoWeb-8B achieves state-of-the-art results among open-weight models on WebVoyager, Online-Mind2Web, and DeepShop, outperforming Fara-7B, UI-Tars-1.5-7B, and Holo1-7B.
- MolmoWeb-8B surpasses set-of-marks (SoM) agents built on GPT-4o, a much larger proprietary model, despite operating from raw screenshots alone.
- Test-time scaling via parallel rollouts with best-of-N selection yields large gains: pass@4 reaches 94.7% on WebVoyager (vs. 78.2% pass@1) and 60.5% on Online-Mind2Web (vs. 35.3% pass@1).
- A 4B parameter variant is also provided, suggesting the approach scales down while remaining competitive.
Why it matters
Web agents are a high-stakes application of multimodal AI, yet the field has been dominated by proprietary systems whose training data and methods are undisclosed. By releasing the full stack—data, model weights, training code, and a unified evaluation harness—the authors provide the research community with a reproducible baseline that is competitive with or superior to closed alternatives. The screenshot-only approach is also notable: it removes dependency on fragile HTML parsing or accessibility tree extraction, making the agent more robust across diverse websites.
Caveats
The benchmarks used (WebVoyager, Online-Mind2Web, DeepShop) test specific task distributions that may not capture the full complexity of real-world web usage. The impressive pass@4 numbers require running four parallel rollouts and selecting the best, which multiplies inference cost. The synthetic trajectory generation pipelines, while diverse, likely still have coverage gaps on unusual or adversarial web interfaces. Additionally, operating purely from screenshots means the agent cannot access information not visually rendered on screen, which could limit performance on tasks requiring scrolling through long pages or interacting with non-visual page elements.