New Benchmarks Reveal AI Systems Struggle with Research-Level Mathematics Despite Olympiad Success

Two independent studies find frontier AI models score below 15% on graduate-level mathematical problems

edit
By LineZotpaper
Published
Read Time3 min
Sources2 outlets
Despite recent headline-grabbing achievements in competition mathematics, leading artificial intelligence systems are scoring below 15% on newly developed benchmarks designed to test genuine research-level mathematical reasoning, according to two independent studies published this month.

Artificial intelligence systems may have conquered the International Mathematical Olympiad, but two new benchmarks published in June 2026 suggest that frontier AI models still fall well short of the mathematical reasoning required for genuine academic research.

The first benchmark, Riemann-Bench, was developed by researchers Suhaas Garre, Erik Knutsen, Sushant Mehta, and Edwin Chen. It comprises a private set of problems authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists — problems that routinely took their own authors weeks to solve. Every problem undergoes double-blind verification by two independent domain experts, and each yields a unique, closed-form solution assessed by programmatic verifiers. Evaluated as unconstrained research agents with access to coding tools, search engines, and open-ended reasoning, all frontier AI models scored below 10%.

The second study, LemmaBench, takes a distinct but complementary approach. Developed by Antoine Peyronnet, Fabian Gloeckle, and Amaury Hayat, it creates an automatically updatable benchmark by extracting lemmas from recent arXiv mathematics preprints and rewriting them into self-contained, formally verifiable statements. Because the problems are drawn continuously from the latest published research, the benchmark remains resistant to contamination from AI training data. Current state-of-the-art models achieved roughly 10–15% accuracy on theorem proving tasks under this framework.

Both studies draw attention to what they describe as a fundamental gap between competition-style problem solving and genuine mathematical research. The Riemann-Bench authors argue that olympiad problems are drawn from limited domains, require minimal advanced machinery, and often reward insightful shortcuts rather than deep theoretical knowledge. Research-level mathematics, by contrast, demands sustained engagement with complex, multi-layered theory.

The LemmaBench team highlights an additional methodological concern: most existing AI math benchmarks rely on static, hand-curated sets of contest or textbook problems, which risk being inadvertently absorbed into training data over time — a phenomenon known as benchmark contamination. By updating LemmaBench regularly with newly published results, its authors aim to ensure that scores reflect current capability rather than memorisation.

The findings arrive at a moment of considerable public enthusiasm about AI's mathematical potential. Several AI laboratories have publicised systems achieving gold-medal-equivalent performance at the IMO, fuelling speculation about AI's readiness to assist or even lead mathematical discovery. These new benchmarks suggest that such enthusiasm may need to be tempered.

Neither research team argues that progress has stalled. Both note that performance at even the 10–15% level on genuinely hard problems is non-trivial, and that rapid improvement in AI capabilities means these figures could shift substantially within a year or two. The benchmarks are designed precisely to track that progress in a rigorous and contamination-resistant way.

§

Analysis

Why This Matters

  • AI performance claims in mathematics have significant downstream effects on research funding, educational policy, and public trust in AI systems; accurate benchmarking is therefore essential.
  • The gap between competition-level and research-level mathematical performance suggests AI tools are not yet ready to meaningfully accelerate frontier mathematics research, despite widespread assumptions to the contrary.
  • Both benchmarks introduce methodological innovations — privacy and continuous updating — that could influence how the broader AI evaluation community designs future assessments.

Background

For several years, competition mathematics has served as a high-profile proxy for AI reasoning ability. Benchmarks such as MATH and GSM8K tracked progress on school and competition problems, and by 2024–2025, multiple AI systems had achieved scores competitive with top human performers on the AMC, AIME, and eventually the IMO itself.

These achievements generated significant media attention and were cited as evidence of rapid progress toward artificial general intelligence. However, critics within the mathematics community noted that olympiad problems, while challenging, occupy a narrow and well-defined problem space that differs substantially from the open-ended, theory-heavy work of professional mathematicians.

The issue of benchmark contamination — where problems appear in AI training data, inflating apparent performance — has grown more pressing as AI training datasets have expanded to encompass much of the publicly available internet, including mathematical problem sets and their solutions.

Key Perspectives

AI Researchers and Developers: Many acknowledge the legitimacy of the critique and welcome harder benchmarks as a more meaningful signal of progress. The existence of sub-10% scores is seen not as failure but as identifying the next frontier for capability development.

Mathematics Community: Academic mathematicians have long been sceptical that competition success translates to research utility. These benchmarks give quantitative weight to that scepticism, suggesting that AI systems capable of generating a proof sketch are still far from the sustained, creative reasoning required for novel mathematical discovery. Critics and Sceptics: Some researchers caution that even private or continuously updated benchmarks carry limitations — problems written by humans may still carry stylistic patterns that AI systems could eventually learn to exploit, and "accuracy" metrics may not fully capture the quality or depth of mathematical reasoning being demonstrated.

What to Watch

  • Score improvements on both Riemann-Bench and LemmaBench over the next 12–18 months, which will indicate whether AI mathematical reasoning is genuinely advancing at the research level.
  • Whether major AI laboratories incorporate these or similar benchmarks into their own internal evaluation suites and public capability disclosures.
  • The potential for benchmark contamination in LemmaBench as arXiv preprints continue to be ingested into AI training pipelines, and how the authors respond to that challenge.

Sources

newspaper

Zotpaper

Articles published under the Zotpaper byline are synthesized from multiple source publications by our AI editor and reviewed by our editorial process. Each story combines reporting from credible outlets to give readers a balanced, comprehensive view.