Language Models Learn Skills in a Predictable, Compositional Order During Training

Across four model families and varying sizes, the authors find that capabilities emerge in a strikingly consistent sequence, with composite skills appearing after their component parts.

Emmy Liu · Kaiser Sun · Millicent Li · Isabelle Lee · Lindia Tjuatja · Jen-tse Huang · +1 more
Research Digest··3 min read
Liu et al. · AI-generated illustration · Zotpaper
Liu et al. · AI-generated illustration · Zotpaper
Liu et al. propose and test the "Implicit Curriculum Hypothesis" — that pretraining follows a predictable, compositional curriculum rather than acquiring skills in an arbitrary order. By tracking when specific capabilities emerge across four model families (410M–13B parameters), they find highly consistent orderings (Spearman ρ = .81 across 45 model pairs) and show that composite tasks reliably emerge after their constituent subtasks.

What they did

The authors designed a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference resolution, logical reasoning, and mathematics. These tasks were deliberately chosen to be compositional — meaning complex tasks could be decomposed into identifiable component skills. They then tracked "emergence points" (when models reach fixed accuracy thresholds) across checkpoints during pretraining for four model families ranging from 410M to 13B parameters, encompassing different architectures and data mixtures.

Beyond behavioral evaluation, they also analyzed internal model representations using "function vectors" — compact representations of how a model encodes a particular task — to see whether the structure of skill acquisition was readable from model internals.

Key findings

  • Emergence orderings are highly consistent across models: Spearman rank correlation of ρ = .81 across 45 model pairs, spanning different sizes, families, and data mixtures.
  • Composite tasks most often emerge after their component subtasks, supporting the compositional structure of the implicit curriculum.
  • Tasks with similar function vector representations follow similar learning trajectories during training, suggesting the curriculum is encoded in the model's internal geometry.
  • The representational structure of the task suite can predict training trajectories of held-out compositional tasks with R² = .68–.84, without ever having evaluated those tasks during training.

Why it matters

Scaling laws tell us that loss decreases predictably with compute, but they say little about which capabilities a model has at any given point. This work provides evidence that capability acquisition is far more structured than aggregate loss suggests. If confirmed at larger scales, this could inform curriculum design for pretraining, enable more efficient evaluation strategies (predicting capabilities without exhaustive benchmarking), and deepen theoretical understanding of how compositional knowledge is built up in neural networks.

Caveats

The task suite, while carefully designed, consists of relatively simple and synthetic compositional tasks. It remains unclear how well these findings generalize to the messy, entangled capabilities required by real-world benchmarks. The largest model studied is 13B parameters — far smaller than frontier models — and the authors acknowledge that emergence patterns could shift at larger scales or with different training regimes. The R² values for trajectory prediction, while respectable, leave meaningful variance unexplained, and the function vector approach may not capture all relevant aspects of how models encode skills.

§

Analysis

This paper connects to a growing body of work on understanding training dynamics beyond aggregate loss, including prior studies on phase transitions, grokking, and emergent abilities. The compositional framing is particularly interesting: rather than treating emergence as a mysterious threshold phenomenon, the authors decompose it into structured, predictable sequences. The finding that internal representations encode curriculum structure echoes work on probing and mechanistic interpretability, but extends it to the temporal dimension of training. A key open question is whether the implicit curriculum is primarily a property of the data distribution (frequent patterns learned first), the architecture, or some interaction between the two. The predictive framework — using function vectors to forecast learning trajectories for unseen tasks — is a practical contribution that could scale to more complex evaluation settings if the methodology proves robust.

newspaper

Research Digest

Articles published under the Zotpaper byline are synthesized from multiple source publications by our AI editor and reviewed by our editorial process. Each story combines reporting from credible outlets to give readers a balanced, comprehensive view.