A cluster of new research frameworks published this week reveals that AI agents, despite rapid progress, still struggle to complete complex professional tasks autonomously — and that the most productive path forward may be collaboration with humans rather than full automation.
Researchers across multiple institutions have released three independent evaluation frameworks this week that collectively paint a nuanced picture of where AI agents succeed, where they fall short, and how their performance changes when working alongside human professionals.
CollabSkill: Humans and AI Together
A team from Stanford and Carnegie Mellon introduced CollabSkill, a framework designed to evaluate AI agents not in isolation, but in partnership with real human workers. Drawing on more than 1,500 prompts collected across 386 working sessions involving 93 human participants, the study matched workers to tasks aligned with their professional backgrounds.
The results challenge assumptions baked into conventional AI benchmarks. Claude Code ranked first among agents tested in the collaborative setting — a notable divergence from fully autonomous benchmarks, where OpenAI's Codex leads. The finding suggests that rankings produced by standard leaderboards may not predict which AI tools are most useful to real workers.
On the human side, the study found that practical, hands-on experience with AI tools — rather than formal education or general literacy — was the primary driver of collaboration skill. Workers who actively engaged with AI agents improved their AI literacy over the course of the study, pointing to collaboration itself as a form of professional development.
Workflow-GYM: Professional Software Remains a Barrier
A separate team introduced Workflow-GYM, a benchmark focused on long-horizon tasks in professional software environments — tools used in specialised fields rather than general-purpose applications. The results were sobering: even the strongest models tested achieved success rates only slightly above 30%.
Researchers identified several recurring failure modes: agents frequently skipped workflow stages, allowed errors to cascade through multi-step processes, drifted from their original objectives, and demonstrated insufficient understanding of domain-specific software interfaces. The findings suggest that AI agents optimised for general web browsing or simple task completion are not yet ready for the end-to-end automation of economically significant professional workflows.
PathoSage: Tackling Conflicting Evidence in Pathology
A third study addressed a specific and high-stakes domain: computational pathology. The PathoSage framework was designed to address a known weakness in multimodal AI systems — their tendency to hallucinate morphological features or make unreliable decisions when faced with conflicting evidence from multiple sources.
PathoSage separates the processes of knowledge retrieval, evidence collection, and final judgment into distinct stages, preventing earlier outputs from contaminating later reasoning. The system also incorporates a reliability-tracking mechanism that adjusts how much weight it gives to individual tools based on their historical performance. In testing, PathoSage outperformed both standard pathology AI models and existing agentic systems on visual question-answering tasks, reducing hallucinations and handling disagreement between classifiers more robustly.
A Shared Message
Taken together, the three papers converge on a common theme: AI agents are advancing quickly, but standardised benchmarks have consistently overstated their real-world readiness. Evaluation frameworks that reflect the messiness of professional environments — conflicting information, multi-step dependencies, human variability — reveal gaps that simpler tests obscure. The researchers collectively argue that the field needs richer, more realistic evaluation infrastructure to guide development of agents that genuinely augment human workers rather than merely simulate competence in controlled settings.
Analysis
Why This Matters
- These studies signal that widely used AI benchmarks may be misleading investors, employers, and policymakers about how ready AI agents actually are for professional deployment — with real consequences for workforce planning and automation investment.
- The CollabSkill finding that hands-on collaboration improves workers' AI literacy suggests organisations should design workflows that build human skill alongside AI capability, not just replace human effort.
- The 30% success rate on professional GUI tasks in Workflow-GYM sets a concrete, sobering baseline against which future model improvements can be measured.
Background
The past two years have seen a proliferation of AI "agent" systems — AI that can browse the web, write and execute code, operate software interfaces, and chain together multi-step tasks without continuous human input. Alongside these capabilities, a competitive ecosystem of benchmarks has emerged to rank agents against one another.
However, critics have long argued that popular benchmarks such as WebArena, OSWorld, and various coding leaderboards test narrow, artificial task variants that don't reflect the complexity of real occupational workflows. Agents trained or fine-tuned to perform well on these benchmarks may not transfer their capabilities to genuine professional environments involving domain-specific software, conflicting data, or human collaboration.
The medical AI field has faced a parallel challenge: multimodal models applied to pathology imaging have demonstrated high accuracy on curated datasets while exhibiting alarming hallucination rates in less controlled conditions. PathoSage joins a growing body of work attempting to build more reliable clinical AI by restructuring how evidence is handled internally, rather than simply scaling model size.
Key Perspectives
AI Researchers and Benchmark Designers: The authors of all three papers argue that the field has under-invested in realistic, human-centred evaluation. They present their frameworks as infrastructure contributions — tools the broader community can use to develop and compare agents more honestly.
Enterprise and Workforce Planners: Organisations considering AI-driven automation of professional workflows will find the Workflow-GYM results cautionary. A 30% task success rate in professional software environments suggests full automation of complex white-collar work remains a distant prospect, and that human oversight will remain essential for the foreseeable future.
Critics and Skeptics: Some researchers argue that benchmark proliferation itself is becoming a problem — each new framework risks being gamed or cherry-picked just as its predecessors were. Others note that studies drawing on relatively small samples of human workers (93 in CollabSkill's case) may not generalise across industries, skill levels, or cultural contexts. The gap between academic evaluation and real-world deployment conditions remains wide.
What to Watch
- Whether major AI labs adopt CollabSkill or Workflow-GYM as standard evaluation tools alongside their existing autonomous benchmarks — adoption would signal a genuine shift toward human-centred assessment.
- Replication studies that test CollabSkill's findings across larger and more diverse worker populations, particularly in non-English-speaking or non-Western professional contexts.
- Clinical trials or hospital pilots of PathoSage-style adjudication frameworks, which would be the next step toward real-world pathology deployment and regulatory scrutiny.