New AI Benchmark Suggests AGI Is Not Even Close as Top Models Score Below One Percent
ARC-AGI-3 dropped the same week Jensen Huang declared AGI achieved with Gemini scoring 0.37 percent and GPT-5.4 hitting 0.26 percent while humans score 100
ARC-AGI-3 tests the kind of abstract reasoning and novel problem-solving that current AI systems consistently struggle with. Unlike benchmarks that can be gamed through training data contamination or pattern matching, ARC tasks require genuine generalisation, the ability to understand a new problem and solve it without having seen similar examples.
The timing of the release was pointed. Jensen Huang's declaration that AGI had arrived was widely covered and debated. The ARC-AGI-3 results suggest the gap between current AI capabilities and human-level general intelligence remains enormous, at least on tasks that require flexible reasoning rather than pattern completion.
The sub-one-percent scores from frontier models like Gemini and GPT-5.4 are particularly notable because these represent the most capable systems available. The benchmark is not testing obscure edge cases but fundamental cognitive abilities that any human can demonstrate.
Analysis
Why This Matters
The gap between AI industry marketing and measured capability has never been more visible. When the CEO of the world's most valuable company declares AGI achieved in the same week a rigorous benchmark shows frontier models scoring below one percent on general reasoning tasks, something in the narrative does not add up.
Background
ARC-AGI was created by Francois Chollet, the creator of Keras, specifically to measure progress toward general intelligence in a way that resists gaming. Previous versions showed steady improvement from AI systems, but version 3 appears to have raised the bar significantly.
What to Watch
Whether AI labs respond with targeted improvements on ARC-style reasoning, and whether the benchmark influences the public conversation about AGI timelines.