Two research teams have published independent advances in large language model test-time scaling, with one system achieving gold-medal-level performance on elite mathematics competitions and another demonstrating significant reductions in the computational cost of generating reliable answers — together pointing toward a more capable and efficient generation of AI reasoning systems.
Two Approaches to a Shared Challenge
Large language models have become increasingly capable at reasoning tasks, but unlocking their best performance often requires running many parallel attempts and aggregating the results — a technique known as test-time scaling. While effective, this approach carries a steep computational price. Two papers published this week on arXiv address that tension from different angles.
MARS: Smarter Stopping, Lower Costs
Researchers from a team including Wenbo Chen, Weijie Su, and colleagues introduced MARS — Margin-Adversarial Risk-controlled Stopping — a method designed to cut the token cost of so-called self-consistency inference without sacrificing accuracy.
In standard parallel scaling, a model generates many independent reasoning traces and takes a majority vote of their final answers. MARS instead monitors those traces at intermediate checkpoints, examining whether the emerging vote leader is stable enough to declare a winner early.
The system separates two distinct sources of uncertainty: how likely individual traces are to change their final answer as they continue generating, and where those changing traces might land. The first is estimated with a compact five-feature logistic model; the second is handled with a deliberately conservative adversarial bound derived from a small set of warm-up traces.
The result is a stopping rule that, with high probability, produces the same answer as running every trace to completion. Tested across three reasoning models and three competition mathematics benchmarks, MARS reduced token consumption by 25 to 47 percent in standard self-consistency settings. Even applied on top of DeepConf Online — a strong existing baseline that already filters and truncates weaker traces — MARS delivered an additional 14 to 29 percent reduction with no measurable accuracy loss.
MaxProof: Gold-Medal Mathematics
A separate team at MiniMax introduced MaxProof, a framework aimed not at efficiency but at pushing the ceiling of what AI can prove. Rather than simply sampling answers, MaxProof organises a model's capabilities into a structured pipeline: generating candidate proofs, verifying them with a low false-positive-rate checker, repairing flawed attempts using targeted critique, and finally selecting a winner through tournament ranking.
The underlying MiniMax-M3 model was trained specifically to support all four roles simultaneously. At test time, MaxProof searches across a population of candidate proofs before returning a single, verified result.
The performance figures are striking. With MaxProof scaling applied, the M3 model scored 35 out of 42 on the 2025 International Mathematical Olympiad and 36 out of 42 on the 2026 United States of America Mathematical Olympiad — results the researchers say exceed the human gold-medal threshold on both competitions.
Complementary Directions
The two systems reflect complementary pressures on AI development. MARS addresses the real-world deployment concern that scaled inference is expensive, offering a principled way to spend less compute while preserving quality. MaxProof pursues capability benchmarks that until recently seemed firmly out of reach for automated systems.
Neither paper has yet undergone formal peer review, as both were posted as preprints. Independent verification of the claimed results — particularly MaxProof's olympiad scores — will be an important next step before the broader research community draws firm conclusions.
Analysis
Why This Matters
- Deployment economics: MARS directly addresses one of the most practical barriers to deploying capable AI systems at scale — inference costs. Reductions of 25–47% in token usage could meaningfully lower the cost of AI-assisted reasoning tools in education, research, and enterprise applications.
- Benchmark significance: Exceeding the human gold-medal threshold on the IMO and USAMO would represent a qualitative shift in AI mathematical capability, with potential implications for automated theorem proving, scientific discovery, and mathematical research assistance.
- Methodological influence: Both papers introduce techniques — adversarial stopping bounds and generative-verifier pipelines — that are likely to influence how the broader research community approaches test-time scaling going forward.
Background
Test-time scaling emerged as a major focus of AI research following observations that allowing models to "think longer" — generating more tokens or more parallel attempts — reliably improves accuracy on difficult reasoning tasks. OpenAI's o1 and o3 series, Google DeepMind's Gemini reasoning models, and Meta's work on chain-of-thought inference all reflect this trend. The approach trades inference-time compute for accuracy, raising questions about cost and latency for real-world use.
Mathematical olympiad problems have long served as a demanding benchmark for AI reasoning. Early systems struggled with even elementary competition problems. Progress accelerated notably from 2023 onward, with models from DeepMind, OpenAI, and Chinese labs progressively improving on IMO problems. However, producing complete, formally verifiable proofs — rather than sketched solutions — has remained substantially harder than producing correct numerical answers.
The two techniques presented this week build on distinct research threads: MARS extends work on conformal prediction and uncertainty quantification in inference, while MaxProof draws on a growing literature combining neural proof generation with formal verification and reinforcement learning from verifier feedback.
Key Perspectives
AI capability researchers: MaxProof's olympiad results, if independently confirmed, would mark a new high-water point for automated mathematical reasoning and lend credibility to claims that AI is approaching research-level mathematical ability in narrow domains.
AI deployment and infrastructure teams: MARS offers a practically actionable improvement. Its use of a simple five-feature logistic classifier — rather than a complex learned model — makes it relatively straightforward to integrate into existing inference pipelines without retraining underlying models.
Critics and independent reviewers: Both papers are preprints and have not been peer reviewed. The olympiad scoring methodology for MaxProof warrants scrutiny — specifically, how proofs were evaluated and whether the scoring mirrors official competition standards. For MARS, questions remain about how well the warm-up calibration generalises across domains outside competition mathematics.
What to Watch
- Independent replication of MaxProof's IMO and USAMO scores, particularly whether formal proof verification by mathematicians or proof assistants like Lean confirms the results.
- Peer review outcomes for both papers, which will subject their methodology and empirical claims to structured external scrutiny.
- Adoption by major AI labs of MARS-style adaptive stopping as a cost-reduction technique in production inference systems, which would signal that the approach generalises beyond the benchmarks tested.