Training Techniques
Curriculum Learning
Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.
Prerequisites
Why This Matters
Humans learn arithmetic before calculus. We learn letters before words, words before sentences. The order in which examples are presented affects learning speed and final performance.
Curriculum learning applies this principle to machine learning: present training examples in order of increasing difficulty. Bengio et al. (2009) showed that this can both speed up convergence and lead to better local optima. The idea is simple. The difficulty is defining "easy" and implementing the schedule.
Formal Setup
Curriculum
A curriculum is a sequence of distributions over training examples, where the support of is typically a subset of the full training set, and difficulty increases with . The final distribution is the uniform distribution over all training examples.
Formally, let be a weighting function over examples at step . The curriculum defines a weighted empirical risk:
A curriculum schedules so that easy examples have higher weight early in training.
Difficulty Score
A difficulty score assigns a scalar to each training example measuring how hard it is to learn. Common choices:
- Loss-based: where is an initial or pretrained model
- Confidence-based:
- Human-defined: annotator agreement, label noise estimates
- Data complexity: input length, number of objects in an image
Why Curricula Can Help
Heuristic: Curriculum as Continuation Method
Bengio et al. (2009) sketches an informal analogy between curriculum learning and continuation (homotopy) methods for non-convex optimization. This is a heuristic argument, not a theorem. No assumption-to-conclusion guarantee is proved.
The analogy: consider a sequence of objectives where is a smoothed version of the full loss using only easy examples. Define and increase from 0 to 1. If each has a simpler landscape than and the minimizer of lies in the basin of attraction of a good minimizer of , tracking the trajectory of minimizers may land in a better region of than random initialization.
Two things to note. First, the continuous-trajectory claim is only suggestive: curricula in practice use discrete weight changes, stochastic optimization, and non-smooth losses, so homotopy continuation is a picture rather than a proof. Second, the assumption that easy-example basins contain full-loss optima is not generally true. If easy and hard examples require qualitatively different features, the curriculum can guide the model to a minimum that is good for easy examples but poor overall. Empirical results on curricula vary across tasks; treat the continuation-method framing as motivation, not as an existence result.
Self-Paced Learning
Kumar, Packer, and Koller (2010): instead of fixing the curriculum in advance, let the model decide what is easy. At each step, include examples that the current model can handle (low loss) and exclude examples it cannot.
Self-Paced Learning as Joint Optimization
Statement
Kumar, Packer, Koller (2010) define self-paced learning (binary variant) as
where are hard selection indicators and is a pace parameter. For fixed , the closed form is . Examples with loss at or below are included; the rest are excluded. Increasing over training admits harder examples.
Jiang et al. (2014, 2015) propose soft variants that relax and replace the linear reward with a regularizer . Two common choices:
- Linear soft regularizer gives .
- Logarithmic regularizer (with ) yields a smoother weight.
Soft variants interpolate between full inclusion and exclusion and track per-example confidence rather than a hard threshold.
Intuition
The term rewards inclusion (without it, trivially minimizes the objective). In the binary case this yields a hard threshold at . In the soft case the optimal weight decreases smoothly from 1 at to 0 at , giving graded confidence.
Proof Sketch
Binary case: for fixed , the objective decomposes across as . On , the minimum is when and otherwise.
Linear soft case: for fixed , minimize over . The first-order condition gives , clipped to .
Why It Matters
Self-paced learning removes the need for external difficulty labels. The model's own loss serves as the difficulty measure. This makes curriculum learning applicable even when human difficulty labels are unavailable.
Failure Mode
Self-paced learning can ignore hard but important examples indefinitely if increases too slowly. It can also create a feedback loop: the model never trains on hard examples, so they always have high loss, so they are never included.
Anti-Curriculum: Hard Examples First
Bengio et al. (2009) found curricula generally help, but subsequent work showed exceptions. Presenting hard or diverse examples first can sometimes work better, especially when:
- Hard examples contain the most information about decision boundaries.
- Easy examples are redundant (the model learns them quickly regardless).
- The difficulty ordering is inaccurate.
This connects to importance sampling: upweighting examples with high loss (hard examples) reduces the variance of the gradient estimator. The optimal importance sampling distribution is proportional to the per-example gradient norm, which correlates with difficulty.
Connection to Importance Sampling
The gradient of the empirical risk is . If we sample examples with probability instead of uniformly, the variance of the stochastic gradient decreases.
This is the opposite of curriculum learning: importance sampling upweights hard examples (high gradient norm) while curriculum learning downweights them. The resolution is that they solve different problems. Curriculum learning addresses non-convex optimization (finding good basins), while importance sampling addresses stochastic gradient variance (faster convergence within a basin).
Common Confusions
Curriculum learning is not the same as data augmentation
Data augmentation adds modified copies of examples. Curriculum learning changes the order or weighting of existing examples. They are complementary and can be combined.
Shuffling is not a curriculum
Standard training shuffles data randomly each epoch. A curriculum is a deliberate ordering. Random shuffling is the control against which curriculum benefits are measured.
Modern Directions
RHO-Loss (Mindermann et al. 2022). Prioritizes points that are learnable, worth learning, and not yet learnt. The selection score uses the reducible holdout loss: training loss minus a held-out irreducible loss estimate. This skips noisy and already-learned points and outperforms uniform sampling on large web-scale datasets.
Automated curricula (Graves et al. 2017). Treat task or data-slice selection as a multi-armed bandit where the reward is learning progress (loss decrease or complexity gain). A policy over curriculum tasks is learned online, avoiding a fixed hand-tuned schedule.
Data Shapley (Ghorbani and Zou 2019). Assigns each training point a Shapley value that measures its marginal contribution to model performance. Negative-value points are noisy or harmful; removing them acts as an offline curriculum. Cost is exponential in the number of points without approximation.
LLM pretraining curricula. Modern pretraining uses data-mix curricula rather than per-example ordering. Source weights, quality filters, and annealing schedules shape what the model sees. DeepSeek-V3 details multi-stage data mixes; Phi models (Gunasekar et al. 2023) train on small synthetic "textbook" data; DataComp-LM (Li et al. 2024) benchmarks curation strategies. These treat the full pretraining corpus as the unit of curriculum design.
Focal loss and hard-example mining. Focal loss (Lin et al. 2017) uses to down-weight easy, high- confidence examples, which is an inverse-curriculum principle: train more on hard points, not easy ones. Hard negative mining in contrastive learning (Robinson et al. 2021) similarly reweights toward hard negatives. The inverse-curriculum intuition aligns with the importance sampling view in the next section.
Why Curriculum Learning Is Underused
Despite positive results, curriculum learning is rare in practice because:
- Difficulty is hard to define: loss-based difficulty changes during training, human annotations are expensive, data complexity metrics are domain-specific.
- Hyperparameter sensitivity: the pace schedule (how fast to add harder examples) requires tuning.
- Modern regularization works well: dropout, data augmentation, and large batch training often provide sufficient generalization without curricula.
- Conflicting evidence: results vary across tasks and architectures, with no clear consensus on when curricula help.
Exercises
Problem
In self-paced learning with pace parameter , which examples are included if the per-example losses are ?
Problem
Explain why curriculum learning and importance sampling give opposite prescriptions for example weighting. Under what conditions would you prefer each approach?
References
Canonical:
- Bengio, Louradour, Collobert, Weston, "Curriculum Learning", ICML 2009 (Section 4 for continuation-method analogy, informal)
- Kumar, Packer, Koller, "Self-Paced Learning for Latent Variable Models", NeurIPS 2010 (binary variant, Equation 4)
- Jiang, Meng, Zhao, Shan, Hauptmann, "Self-Paced Learning with Diversity", NeurIPS 2014 (soft variant with diversity regularizer)
- Jiang, Zhou, Leung, Li, Hauptmann, "Self-Paced Curriculum Learning", AAAI 2015 (general self-paced regularizer family)
Modern methods:
- Mindermann, Brauner, Razzak, Sharma, Kirsch, Xu, Holtgen, Gomez, Morisot, Farquhar, Gal, "Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt" (RHO-Loss), ICML 2022
- Graves, Bellemare, Menick, Munos, Kavukcuoglu, "Automated Curriculum Learning for Neural Networks", ICML 2017
- Ghorbani, Zou, "Data Shapley: Equitable Valuation of Data for Machine Learning", ICML 2019
- Lin, Goyal, Girshick, He, Dollar, "Focal Loss for Dense Object Detection", ICCV 2017 (inverse-curriculum via weighting)
- Robinson, Chuang, Sra, Jegelka, "Contrastive Learning with Hard Negative Samples", ICLR 2021
LLM pretraining curricula:
- Gunasekar et al., "Textbooks Are All You Need" (Phi-1), 2023
- Li et al., "DataComp-LM: In Search of the Next Generation of Training Sets for Language Models", 2024
- DeepSeek-AI, "DeepSeek-V3 Technical Report", 2024 (data-mix annealing, Section 3)
Surveys:
- Soviany, Ionescu, Rota, Sebe, "Curriculum Learning: A Survey", IJCV 2022
- Katharopoulos, Fleuret, "Not All Samples Are Created Equal: Deep Learning with Importance Sampling", ICML 2018
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Synthetic Data Distillationlayer 3 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.