Monday 30 March 2026Afternoon Edition

ZOTPAPER

News without the noise


AI & Machine Learning

New AI Benchmark Suggests AGI Is Not Even Close as Top Models Score Below One Percent

ARC-AGI-3 dropped the same week Jensen Huang declared AGI achieved with Gemini scoring 0.37 percent and GPT-5.4 hitting 0.26 percent while humans score 100

Zotpaper2 min read
The latest ARC-AGI benchmark, version 3, has delivered a stark counterpoint to industry claims about artificial general intelligence. Gemini scored 0.37 percent. GPT-5.4 managed 0.26 percent. Humans scored 100 percent. The benchmark dropped the same week Nvidia CEO Jensen Huang declared that AGI had been achieved.

ARC-AGI-3 tests the kind of abstract reasoning and novel problem-solving that current AI systems consistently struggle with. Unlike benchmarks that can be gamed through training data contamination or pattern matching, ARC tasks require genuine generalisation, the ability to understand a new problem and solve it without having seen similar examples.

The timing of the release was pointed. Jensen Huang's declaration that AGI had arrived was widely covered and debated. The ARC-AGI-3 results suggest the gap between current AI capabilities and human-level general intelligence remains enormous, at least on tasks that require flexible reasoning rather than pattern completion.

The sub-one-percent scores from frontier models like Gemini and GPT-5.4 are particularly notable because these represent the most capable systems available. The benchmark is not testing obscure edge cases but fundamental cognitive abilities that any human can demonstrate.

Analysis

Why This Matters

The gap between AI industry marketing and measured capability has never been more visible. When the CEO of the world's most valuable company declares AGI achieved in the same week a rigorous benchmark shows frontier models scoring below one percent on general reasoning tasks, something in the narrative does not add up.

Background

ARC-AGI was created by Francois Chollet, the creator of Keras, specifically to measure progress toward general intelligence in a way that resists gaming. Previous versions showed steady improvement from AI systems, but version 3 appears to have raised the bar significantly.

What to Watch

Whether AI labs respond with targeted improvements on ARC-style reasoning, and whether the benchmark influences the public conversation about AGI timelines.

Sources