On-policy self-distillation has become a popular post-training method for large language models (LLMs), allowing developers to transfer capabilities from a high-performing teacher model to a smaller, more deployable student model. However, researchers from multiple institutions have identified persistent problems with how these techniques handle nuanced tasks, spurring a new wave of refinement.
Addressing the Safety-Helpfulness Trade-off
A team from Fudan University, led by Ming Wen and colleagues, tackled a particularly thorny problem: using distillation to instil safety behaviour in AI models. Their paper introduces Constitutional On-Policy Safe Distillation (COPSD), which addresses what the authors call 'geometric leakage under safety boundaries.'
The core issue, they found, is that when a teacher model is conditioned on safety guidelines — known as 'constitutions' — it tends to produce short, overly cautious responses. A mathematical technique called Reverse KL divergence then amplifies this conservatism, causing the student model to become less expressive overall. In other words, making a model safer was inadvertently making it less useful.
COPSD counters this through a two-stage process: first recalibrating the teacher model using a technique called Cross-SFT cold-start, then performing constitution-conditioned on-policy distillation. Tested across 12 benchmarks, COPSD reportedly outperforms baseline methods on both safety and general reasoning, while reducing what the team terms the 'safety tax' — the performance penalty typically incurred when safety constraints are added.
Improving Granularity in Training Signal Selection
A separate team, led by Yuying Li and colleagues, approached distillation from a different angle: the quality and weighting of training signals. Their method, FiRe-OPD (Filter, then Reweight), addresses the tendency of existing distillation methods to treat all training data equally.
FiRe-OPD works in two steps. First, it filters out low-quality sample trajectories — sequences of model outputs used during training — before they can introduce noise. Second, it applies a 'soft reweighting' mechanism to emphasise the most informative individual tokens within the retained data, rather than simply discarding tokens outright as some rival methods do.
The researchers argue this finer-grained approach preserves more useful information while improving training stability. In benchmark testing, FiRe-OPD showed notable gains over competing token-level distillation methods, including a 6.25-point improvement on the AIME 2024 mathematics benchmark in strong-to-weak distillation settings, and an 18.81-point improvement on the Miner benchmark in multi-teacher settings. The code has been made publicly available on GitHub.
Complementary Directions
While the two papers address different aspects of on-policy distillation, they reflect a shared recognition that the field is moving beyond blunt, full-trace supervision toward more selective and targeted approaches. COPSD prioritises the integrity of safety behaviour during compression, while FiRe-OPD focuses on the mechanical quality of the training signal itself. Together, they suggest that future LLM training pipelines may need to address both concerns simultaneously to produce models that are efficient, capable, and reliably safe.