Monday 30 March 2026Afternoon Edition

ZOTPAPER

News without the noise


AI & Machine Learning

The Most Trusted AI Coding Benchmark Was Deeply Flawed All Along

OpenAI audit finds 59 per cent of hardest SWE-bench problems had broken answer keys while top models scored within one point of each other

Zotpaper2 min read
The benchmark that the entire AI industry relied on to measure coding ability has been exposed as fundamentally broken. An audit by OpenAI's Frontier Evals team found that 59.4 per cent of the hardest problems in SWE-bench Verified had flawed test cases, meaning models were being scored against incorrect answers.

SWE-bench Verified is considered the gold standard for evaluating how well AI can write code. It presents 500 real GitHub issues to AI models and checks whether their patches fix the bugs by running automated test suites.

The audit revealed multiple categories of failure. Some test cases were too narrow, marking valid fixes as wrong because they did not match the exact expected code. Others were too broad, letting any change pass. Most damning was evidence of benchmark contamination — AI models reproducing gold-standard patches word for word, suggesting they had memorised answers from training data.

The top five models scored between 80.0 and 80.9 per cent on the existing benchmark, a spread of just 0.9 points. This clustering suggests the benchmark had lost its ability to differentiate between models. A new harder version called SWE-bench Pro saw the top score drop to 57.7 per cent from GPT-5.4, restoring meaningful separation between competitors.

The findings come from a comprehensive review covering 21 major AI research findings from February and March 2026, which also identified similar trust issues across other benchmarks used to evaluate AI capabilities.

Analysis

Why This Matters

Companies and researchers have been making product decisions, investment choices, and capability claims based on SWE-bench scores for over a year. If the benchmark itself was broken, those decisions were built on sand.

Background

Benchmark integrity is a recurring problem in AI. Models are increasingly trained on data that includes benchmark answers, creating a feedback loop where scores go up but real-world ability does not. The contamination issue found here — models reproducing exact patches — is particularly concerning.

Key Perspectives

The near-identical scores at the top suggest a ceiling effect rather than genuine capability convergence. When the hardest problems have broken answer keys, the benchmark effectively becomes a test of memorisation rather than engineering skill.

What to Watch

SWE-bench Pro aims to fix these issues with harder problems and better test validation. Whether the industry adopts it quickly or continues citing the old numbers will say a lot about how seriously the field takes evaluation integrity.

Sources