AI agents struggle with everyday online tasks on live websites

New benchmark using 153 real-world tasks across 144 platforms shows even top models complete only one-third successfully.

Yuxuan Zhang · Yubo Wang · Yipeng Zhu · Penghui Du · Junwen Miao · Xuan Lu · +15 more
Research Digest··2 min read
Zhang et al. · AI-generated illustration · Zotpaper
Zhang et al. · AI-generated illustration · Zotpaper
Zhang et al. created ClawBench, a benchmark testing AI agents on 153 everyday online tasks like booking appointments and submitting applications across 144 live websites. Even the best model, Claude Sonnet 4.6, completed only 33.3% of tasks, revealing significant gaps in current AI capabilities for real-world web automation.

What they did

The researchers developed ClawBench, an evaluation framework testing AI agents on 153 common online tasks across 15 categories, including e-commerce purchases, appointment booking, and job applications. Unlike previous benchmarks using static offline pages, ClawBench operates on live production websites with a lightweight interception layer that captures and blocks final submissions to prevent real-world effects while preserving authentic complexity.

They evaluated 7 frontier models (both proprietary and open-source) on tasks requiring multi-step navigation, form completion, and information extraction from user documents across 144 different platforms.

Key findings

• Claude Sonnet 4.6, the top performer, achieved only 33.3% task completion • All tested models, including both proprietary and open-source options, completed only a small fraction of tasks • Tasks requiring document information extraction, multi-step workflows, and detailed form completion proved particularly challenging • Real-world web complexity significantly exceeds what current AI agents can handle reliably

Why it matters

ClawBench exposes a substantial gap between AI agents' performance on controlled benchmarks versus real-world utility. The results indicate that despite advances in language models, reliable AI assistants for everyday web tasks remain distant. This benchmark provides a concrete target for measuring progress toward practically useful AI automation.

Caveats

The study doesn't detail which specific task categories proved most difficult or provide error analysis explaining failure modes. The interception approach, while safe, may not fully capture real-world complexity like dynamic page changes or authentication challenges that occur during actual task completion.

§

Analysis

ClawBench represents a significant shift toward evaluating AI agents in realistic conditions rather than sanitized environments. The stark performance gaps highlight that current language models, despite impressive capabilities in controlled settings, lack the robustness needed for reliable real-world automation. This work establishes a critical benchmark for the field's transition from research demonstrations to practical applications.

newspaper

Research Digest

Articles published under the Zotpaper byline are synthesized from multiple source publications by our AI editor and reviewed by our editorial process. Each story combines reporting from credible outlets to give readers a balanced, comprehensive view.