Open-Source Web Agents Match or Beat Proprietary Models on Browser Tasks

MolmoWeb, a family of 4B and 8B parameter visual-language models trained on a fully released dataset, outperforms both similar-scale open models and larger closed models like GPT-4o on web navigation benchmarks.

PaperarXiv:2604.08516v1 ↗

Tanmay Gupta · Piper Wolters · Zixian Ma · Peter Sushko · Rock Yuren Pang · Diego Llanes · +10 more

Research Digest·12 April 2026·3 min read

Read the paper →

What they did

The authors constructed MolmoWebMix, a large training mixture that blends over 100K synthetic browser task trajectories generated through multiple complementary pipelines with more than 30K human-collected demonstrations, atomic web-skill trajectories, and GUI perception data (referring expression grounding and screenshot question answering). The synthetic data pipelines appear designed to cover diverse web environments and task types.

MolmoWeb agents are trained as instruction-conditioned visual-language action policies: at each step they receive a task description and a screenshot, then predict the next browser action (click coordinates, typing, scrolling, etc.). Crucially, they require no access to page HTML, accessibility trees, or browser-specific APIs—operating purely from pixels. The authors trained 4B and 8B parameter versions and evaluated them on three established benchmarks: WebVoyager, Online-Mind2Web, and DeepShop.

Key findings

MolmoWeb-8B achieves state-of-the-art results among open-weight models on WebVoyager, Online-Mind2Web, and DeepShop, outperforming Fara-7B, UI-Tars-1.5-7B, and Holo1-7B.
MolmoWeb-8B surpasses set-of-marks (SoM) agents built on GPT-4o, a much larger proprietary model, despite operating from raw screenshots alone.
Test-time scaling via parallel rollouts with best-of-N selection yields large gains: pass@4 reaches 94.7% on WebVoyager (vs. 78.2% pass@1) and 60.5% on Online-Mind2Web (vs. 35.3% pass@1).
A 4B parameter variant is also provided, suggesting the approach scales down while remaining competitive.

Why it matters

Web agents are a high-stakes application of multimodal AI, yet the field has been dominated by proprietary systems whose training data and methods are undisclosed. By releasing the full stack—data, model weights, training code, and a unified evaluation harness—the authors provide the research community with a reproducible baseline that is competitive with or superior to closed alternatives. The screenshot-only approach is also notable: it removes dependency on fragile HTML parsing or accessibility tree extraction, making the agent more robust across diverse websites.

Caveats

The benchmarks used (WebVoyager, Online-Mind2Web, DeepShop) test specific task distributions that may not capture the full complexity of real-world web usage. The impressive pass@4 numbers require running four parallel rollouts and selecting the best, which multiplies inference cost. The synthetic trajectory generation pipelines, while diverse, likely still have coverage gaps on unusual or adversarial web interfaces. Additionally, operating purely from screenshots means the agent cannot access information not visually rendered on screen, which could limit performance on tasks requiring scrolling through long pages or interacting with non-visual page elements.

This work represents a significant push toward open infrastructure for web agents, an area where closed models from Anthropic, Google, and OpenAI have dominated. The combination of large-scale synthetic data with human demonstrations echoes successful strategies in other domains (e.g., code generation), and the test-time scaling results via best-of-N selection align with a broader trend of extracting more capability at inference time rather than through larger models. The screenshot-only input modality is both a strength (generality, simplicity) and a constraint—it will be interesting to see whether hybrid approaches that optionally incorporate structured page information can push further. The release of MolmoWebMix may prove as impactful as the models themselves, enabling the community to study data composition effects for agent training.

ZOTPAPER