Researchers Advance AI Training Efficiency With New Knowledge Distillation Techniques

Two independent studies tackle key weaknesses in on-policy distillation, aiming to make AI models safer and smarter without sacrificing reasoning ability

edit
By LineZotpaper
Published
Read Time3 min
Sources10 outlets
Two research teams have published independent studies this week proposing new methods to improve on-policy distillation (OPD), a technique used to train large language models more efficiently by having a smaller 'student' model learn from a more capable 'teacher' model. Both papers, released on arXiv on June 3, 2026, identify fundamental flaws in existing distillation approaches and offer targeted solutions — one focused on safety alignment and the other on optimising the quality of training signals.

On-policy self-distillation has become a popular post-training method for large language models (LLMs), allowing developers to transfer capabilities from a high-performing teacher model to a smaller, more deployable student model. However, researchers from multiple institutions have identified persistent problems with how these techniques handle nuanced tasks, spurring a new wave of refinement.

Addressing the Safety-Helpfulness Trade-off

A team from Fudan University, led by Ming Wen and colleagues, tackled a particularly thorny problem: using distillation to instil safety behaviour in AI models. Their paper introduces Constitutional On-Policy Safe Distillation (COPSD), which addresses what the authors call 'geometric leakage under safety boundaries.'

The core issue, they found, is that when a teacher model is conditioned on safety guidelines — known as 'constitutions' — it tends to produce short, overly cautious responses. A mathematical technique called Reverse KL divergence then amplifies this conservatism, causing the student model to become less expressive overall. In other words, making a model safer was inadvertently making it less useful.

COPSD counters this through a two-stage process: first recalibrating the teacher model using a technique called Cross-SFT cold-start, then performing constitution-conditioned on-policy distillation. Tested across 12 benchmarks, COPSD reportedly outperforms baseline methods on both safety and general reasoning, while reducing what the team terms the 'safety tax' — the performance penalty typically incurred when safety constraints are added.

Improving Granularity in Training Signal Selection

A separate team, led by Yuying Li and colleagues, approached distillation from a different angle: the quality and weighting of training signals. Their method, FiRe-OPD (Filter, then Reweight), addresses the tendency of existing distillation methods to treat all training data equally.

FiRe-OPD works in two steps. First, it filters out low-quality sample trajectories — sequences of model outputs used during training — before they can introduce noise. Second, it applies a 'soft reweighting' mechanism to emphasise the most informative individual tokens within the retained data, rather than simply discarding tokens outright as some rival methods do.

The researchers argue this finer-grained approach preserves more useful information while improving training stability. In benchmark testing, FiRe-OPD showed notable gains over competing token-level distillation methods, including a 6.25-point improvement on the AIME 2024 mathematics benchmark in strong-to-weak distillation settings, and an 18.81-point improvement on the Miner benchmark in multi-teacher settings. The code has been made publicly available on GitHub.

Complementary Directions

While the two papers address different aspects of on-policy distillation, they reflect a shared recognition that the field is moving beyond blunt, full-trace supervision toward more selective and targeted approaches. COPSD prioritises the integrity of safety behaviour during compression, while FiRe-OPD focuses on the mechanical quality of the training signal itself. Together, they suggest that future LLM training pipelines may need to address both concerns simultaneously to produce models that are efficient, capable, and reliably safe.

§

Analysis

Why This Matters

  • On-policy distillation is central to how leading AI labs compress large, expensive models into smaller, deployable ones — improvements here could meaningfully reduce the cost and energy footprint of AI deployment at scale.
  • The safety-helpfulness tension addressed by COPSD is one of the most debated issues in AI alignment; evidence that it can be reduced through better training methodology has implications for how regulators and developers think about safe AI deployment.
  • FiRe-OPD's open-source release means independent researchers and smaller organisations can immediately test and build on these findings, potentially accelerating progress across the field.

Background

Knowledge distillation — training a smaller model to mimic a larger one — has been a staple of machine learning since at least 2015, when Geoffrey Hinton and colleagues formalised the approach. The 'on-policy' variant, which generates training data dynamically from the student model's own outputs rather than a fixed dataset, gained traction as a way to improve training efficiency and model alignment in the era of large language models.

However, on-policy distillation has repeatedly surfaced a core tension: the teacher model's guidance can sometimes constrain rather than expand the student's capabilities. This is particularly acute in safety-oriented training, where constitutional AI approaches — pioneered by Anthropic and others — use written principles to shape model behaviour but can inadvertently narrow the range of acceptable responses.

The challenge of balancing safety with capability has grown more urgent as LLMs are deployed in commercial and public-sector contexts, where both unhelpful over-caution and genuinely unsafe outputs carry real consequences.

Key Perspectives

AI Safety Researchers: The COPSD findings are significant because they offer a concrete mechanism — geometric leakage — to explain why safety training degrades model expressiveness, moving the field from empirical observation toward a more principled understanding.

AI Efficiency and Deployment Engineers: FiRe-OPD's trajectory filtering and soft token reweighting address practical training instability, and its strong benchmark gains in multi-teacher settings suggest it could be useful in complex, real-world training pipelines where multiple model sources are combined.

Critics and Skeptics: Both papers rely heavily on benchmark performance, and critics may note that improvements on datasets like AIME 2024 or Miner may not fully translate to real-world deployment scenarios. The 'safety tax' metric used in COPSD also requires further community scrutiny to determine whether it captures genuine safety trade-offs or reflects benchmark-specific artefacts.

What to Watch

  • Whether COPSD's 'geometric leakage' framework is independently validated by other research groups, which would strengthen its theoretical standing and influence future safety training methodology.
  • Community uptake of the FiRe-OPD codebase on GitHub — significant adoption or forks by major labs would signal practical endorsement beyond the paper's own results.
  • Whether upcoming LLM releases from major labs cite or incorporate these techniques, which would indicate industry-level validation of both approaches.

Sources

newspaper

Zotpaper

Articles published under the Zotpaper byline are synthesized from multiple source publications by our AI editor and reviewed by our editorial process. Each story combines reporting from credible outlets to give readers a balanced, comprehensive view.