What they did
The researchers developed ClawBench, an evaluation framework testing AI agents on 153 common online tasks across 15 categories, including e-commerce purchases, appointment booking, and job applications. Unlike previous benchmarks using static offline pages, ClawBench operates on live production websites with a lightweight interception layer that captures and blocks final submissions to prevent real-world effects while preserving authentic complexity.
They evaluated 7 frontier models (both proprietary and open-source) on tasks requiring multi-step navigation, form completion, and information extraction from user documents across 144 different platforms.
Key findings
• Claude Sonnet 4.6, the top performer, achieved only 33.3% task completion • All tested models, including both proprietary and open-source options, completed only a small fraction of tasks • Tasks requiring document information extraction, multi-step workflows, and detailed form completion proved particularly challenging • Real-world web complexity significantly exceeds what current AI agents can handle reliably
Why it matters
ClawBench exposes a substantial gap between AI agents' performance on controlled benchmarks versus real-world utility. The results indicate that despite advances in language models, reliable AI assistants for everyday web tasks remain distant. This benchmark provides a concrete target for measuring progress toward practically useful AI automation.
Caveats
The study doesn't detail which specific task categories proved most difficult or provide error analysis explaining failure modes. The interception approach, while safe, may not fully capture real-world complexity like dynamic page changes or authentication challenges that occur during actual task completion.