Reinforcement learning has become a cornerstone technique for training advanced AI agents, but researchers continue to grapple with fundamental limitations in how these systems learn, generalise, and avoid critical errors. Three new papers from academic and industry researchers aim to address some of these challenges from different angles.
ReSkill: Making Skills Evolve With the Policy
A team including researchers from Penn State and Amazon proposed ReSkill, a framework designed to ensure that modular skills — reusable strategies an AI agent can call upon — evolve in step with the policy being trained, rather than lagging behind or conflicting with it.
Existing approaches often treat skill creation and policy optimisation as separate processes, which can result in an agent holding onto outdated or counterproductive skills. ReSkill, inspired by Anthropic's Skill Creator concept, embeds skill management directly into the Group Relative Policy Optimization (GRPO) training loop. The system uses failure diagnosis to propose skill revisions, controlled within-group comparisons to test which version of a skill best supports learning, and Thompson Sampling to balance trying new skills against sticking with proven ones.
In tests across several domains, ReSkill outperformed existing memory- and skill-based RL methods, with the most pronounced gains on tasks the model had not seen during training — a strong indicator of genuine generalisation rather than memorisation.
T²-GRPO: Teaching AI to Care for Dementia Patients
A separate team from UC Irvine and partner institutions tackled the specific challenge of training caregiver AI agents to support people with dementia — a domain where balancing immediate emotional responses against long-term care goals is critical and mistakes can carry real consequences.
Their Turn-Trajectory GRPO (T²-GRPO) framework separates rewards into two time horizons: dense turn-level signals derived from a frozen dementia patient simulator, which measure changes in patient distress and resistance in real time, and sparser trajectory-level evaluations of overall care outcomes. A binary hard veto enforces safety constraints throughout.
The researchers argue that existing approaches relying on external LLM-based evaluators are both expensive and prone to misreading indirect or fragmented patient responses. By grounding rewards directly in environment state changes, T²-GRPO avoids this dependency while still achieving strong results on benchmark caregiver tasks.
ISPO: Fixing Structural Failures in Reasoning Models
A third team identified two specific failure modes they say undermine GRPO-based training for mathematical reasoning. The first, which they call Zero-Advantage Collapse, occurs when all outputs in a training group achieve the same outcome, causing gradients to vanish and learning to stall. The second, Hallucinated Certainty, describes a model becoming increasingly confident in wrong answers late in training.
Their proposed solution, Intrinsic Signal Policy Optimization (ISPO), enriches the reward signal using the model's own internal probability distributions — without requiring any external verifier. A sequence-level signal measures how informative a model's reasoning chain is for its final answer, while a token-level component penalises confident errors at key decision points.
Tested across three base models and five mathematical reasoning benchmarks, ISPO consistently outperformed competitive baselines, with the largest improvements on the hardest problems where collapse is most common.
Together, the three papers reflect a broader push in the research community to move beyond binary outcome rewards and toward richer, more structurally sound training signals — a shift that could meaningfully improve the reliability and adaptability of next-generation AI agents.