Autonomous AI Trains Itself — and Catches Its Own Blind Spots
Researchers at NVIDIA have reported what they describe as the first publicly documented autonomous post-training run at frontier scale. The system, called A-Evolve-Training, independently managed weeks of training on a 30-billion-parameter Nemotron model, iterating through data selection, recipe changes, evaluation, and policy revision — all without human intervention.
The result placed 8th out of roughly 4,000 submissions on the public NVIDIA Nemotron-Reasoning Challenge leaderboard, reaching a score of 0.86 against the top human submission's 0.87. The authors are careful not to claim the system outperformed human researchers, but note the scale far exceeds prior autonomous ML demonstrations, which were confined to GPT-2-class models of around 124 million parameters.
Perhaps more notable than the ranking was a specific behaviour the system exhibited mid-run. It detected that its own internal evaluation metric had become misleading — candidate models were improving the proxy score without improving actual performance on the target domain — and revised its search policy accordingly, seeking interventions that lowered the proxy while improving the real target. The authors argue this constitutes evidence of "discovery, not only optimisation," and frame it as a meaningful step toward what they call the bar for "recursive self-improvement."
The same loop was also applied to 120B and 550B Nemotron models, though the authors note the absence of a human baseline at those scales means effectiveness there remains unverified.
Security Researchers Flag Deep Vulnerabilities in Local AI Agents
On the security front, two independent research groups have identified significant risks in the infrastructure surrounding AI agents — findings that arrive as these systems are increasingly deployed on personal machines and in enterprise settings.
One team introduced CLAWAUDIT, a static analysis framework for auditing local LLM agents such as OpenClaw and Nanobot — tools that run directly on end-user hardware and have access to shells, file systems, browsers, credentials, and messaging applications. Using 47 custom Semgrep rules and 30 CodeQL queries derived from a five-category vulnerability taxonomy, the researchers evaluated a benchmark of 446 known advisories. Their tool raised recall from 21.7% to 66.8% on held-out test cases with Semgrep, and from 13.8% to 75.1% with CodeQL — a substantial improvement over off-the-shelf commercial tools.
A separate paper proposed AgentRiskBOM, a structured "security bill of materials" artifact specifically designed for agentic AI. The framework adds a machine-readable layer over existing software transparency standards (SBOM, AIBOM, MLBOM), capturing runtime authority fields including tool permissions, memory scope, credential access, approval gates, and inter-agent communication capabilities. Evaluated against 13 open-source agents and 52 risk scenarios, AgentRiskBOM achieved 100% risk-category visibility compared to roughly 10–21% for existing frameworks. The authors stress the artifact is designed to be generated before incidents, not in response to them.
Smarter Memory for Long-Running AI Agents
Two additional papers tackled the challenge of memory in AI agents tasked with long-horizon work — a practical bottleneck as agents increasingly operate across extended sessions with limited context windows.
The OSL-MR framework, from researchers including those at Huawei, formulates memory retention as a constrained stochastic optimisation problem and demonstrates that simple recency-based or heuristic approaches underperform when agents face tight memory budgets. Evaluated on the LoCoMo and LongMemEval benchmarks, OSL-MR consistently approached dynamic-programming optimal solutions.
A separate system called Nous takes a more philosophical approach, treating memory not as storage but as prediction. Rather than recording facts, Nous maintains probability distributions for entity-attribute pairs, updating beliefs via Bayesian inference and allowing memories to "decay" as entropy increases. On the LoCoMo benchmark, Nous achieved competitive F1 scores across four question categories using GPT-4o-mini as a backbone, requiring no external vector database.
Rounding out the week's publications, researchers introduced Anything2Skill, a framework that compiles external knowledge — manuals, logs, examples, trajectories — into reusable procedural "skills" for agents, achieving success rates above 94% on tested command-line tasks, compared to lower rates from retrieval-only approaches.