LLM Construction
Scaling Laws
Power-law scaling of LLM loss in parameters, data, and compute: Kaplan, Chinchilla, the Muennighoff data-constrained law for repetition, the Schaeffer metric-induced-emergence proposition, MoE and muP extensions, and the test-time compute axis.
Prerequisites
Why This Matters
Scaling Law Board
Smooth power law, expensive mistakes
The clean frontier says loss falls predictably with compute. The dashed curve shows what happens when the run is compute-heavy but data-poor: more FLOPs do not buy the expected loss.
Move the budget marker to see what the fitted scaling law would predict.
This exaggerates the penalty for spending compute without enough training tokens.
law to read
Straight on log-log means one exponent controls the trend.
Predicted loss
2.85
What the clean frontier expects at this compute level.
Undertrained loss
3.01
A run that used compute but did not scale data with it.
What to learn
Scaling laws are not magic forecasts. They assume the experiment stays on the same recipe frontier.
Scaling laws are empirical power-law fits that relate a model's loss to the number of parameters , the training tokens , and the total compute . They are regressions, not physical laws: they hold within the compute regime where they were fit, and they can break under architectural changes, data repetition, or distribution shift.
Within their regime they are useful. They have guided training decisions worth billions of dollars: how large to make a model, how much data to collect, and how to allocate a fixed compute budget between model size and training duration. They connect directly to compute-optimal training and to the transformer architecture that all modern LLMs share. The Kaplan et al. paper breakdown walks through the original 2020 power-law fits and the Chinchilla correction.
Mental Model
Imagine you have a fixed compute budget (say, FLOPs). You must decide: train a large model for fewer steps, or a smaller model for more steps? Scaling laws answer this question precisely. They tell you that loss decreases as a power law in each resource, and that there is an optimal way to split your budget between model size and data.
The key surprise: loss follows smooth, predictable power laws over many orders of magnitude. A model with 10x more parameters trained on the same data will have predictably lower loss. This predictability is what makes scaling laws practically useful: one can extrapolate from small experiments to forecast the performance of much larger runs.
The Kaplan Scaling Laws (2020)
Kaplan Power-Law Scaling
Kaplan et al. (2020) empirically observed that cross-entropy loss on language modeling scales as power laws in parameters , data , and compute , when each is varied independently with the others held sufficient:
where are scaling constants and the exponents are approximately:
- (loss vs parameters)
- (loss vs data)
- (loss vs compute)
These power laws hold over at least 7 orders of magnitude in compute.
Power-Law Scaling of Language Model Loss
Statement
When parameters and data are both potentially limiting, the loss follows an approximate decomposition:
for fitted constants. In the regime where one factor dominates, this reduces to the individual power laws above.
Reading the exponents directly: a larger exponent means loss falls faster as that resource grows. Since , in the isolated power laws loss is more sensitive to data than to model size. Kaplan's recommendation to favor model size at a fixed compute budget came from the joint compute-allocation analysis (using and the joint fit for ), not from the individual exponents. Chinchilla later revisited that allocation analysis and reached the opposite conclusion.
Intuition
A power law means that each 10x increase in gives a fixed percentage reduction in loss. The exponent determines how fast loss improves. With , a 10x increase in data reduces loss more than a 10x increase in parameters does, in the regime where the other factor is not the bottleneck. The compute-allocation story is separate: at fixed , spending a unit of compute on more parameters vs more tokens depends on the joint fit, and that is the analysis Kaplan used to recommend scaling faster than .
Why It Matters
The Kaplan scaling laws directly influenced the training of GPT-3 (175B parameters trained on 300B tokens). The recommendation to favor large models with moderate data was the dominant paradigm from 2020 to 2022. It was overturned by the Chinchilla analysis.
Failure Mode
The Kaplan analysis trained all models for a relatively small number of tokens and extrapolated. Crucially, it did not train smaller models to full convergence—biasing the analysis toward large models. The Chinchilla paper corrected this methodological issue and reached the opposite conclusion about optimal allocation.
Chinchilla Scaling (2022)
Chinchilla-Optimal Compute Allocation
Statement
Hoffmann et al. (2022) showed that for a fixed compute budget , the optimal number of parameters and training tokens both scale proportionally to the square root of compute:
with from the FLOP constraint . Hoffmann et al.'s parametric fit gives and (their Approach 3); the two exponents are close but not equal, so and scale together with compute but at slightly different rates rather than being strictly proportional.
Parameters and training tokens should scale roughly together with compute. At Chinchilla's compute budget, the empirical compute-optimal allocation is about tokens per parameter. Because , this ratio drifts upward slowly with compute (as ) rather than staying fixed. The "doubling parameters means doubling tokens" slogan is the idealization, accurate to leading order but not exact.
The "20 tokens per parameter" figure is an empirical artifact of those specific exponents on Hoffmann et al.'s corpus, not a universal constant. Refits on different model families and corpora yield different ratios, and most modern open models intentionally over-train relative to this ratio because inference cost is dominated by alone.
Intuition
Kaplan said: make the model as large as possible. Chinchilla said: balance model size and data. The difference comes from how you define "optimal." If you fix compute and ask "what achieves the lowest loss?", Chinchilla shows that an overparameterized, undertrained model wastes compute. Training a smaller model on more data reaches the same loss with the same compute budget.
Proof Sketch
The analysis fits a parametric loss function:
where is the irreducible entropy of natural language. Subject to the constraint (total compute), minimize over .
Using Lagrange multipliers: at the optimum, , which gives . Hoffmann et al. fit and (close, but not equal). Because the two exponents are nearly equal, the optimum scales as and with both close to . The exact exponents depend on ; the widely-quoted result is what you get in the limit.
Why It Matters
Chinchilla (70B parameters, 1.4T tokens) matched the performance of Gopher (280B parameters, 300B tokens) with 4x fewer parameters and the same compute budget. This result reshaped the industry: Llama, DeepSeek, Mistral, and most subsequent models are "Chinchilla-optimal" or even "over-trained" (using more data than Chinchilla-optimal for a given size, because inference cost depends on while training cost depends on ).
Failure Mode
Chinchilla-optimal minimizes loss per FLOP of training. But inference cost depends on , not . If you plan to serve a model to millions of users, it is cheaper to over-train a small model (use more data than Chinchilla recommends) than to serve a Chinchilla-optimal larger model. Llama 3 8B was trained on 15T tokens—roughly 1875 tokens per parameter, nearly 100x the Chinchilla ratio—because the inference savings from a smaller model outweigh the training compute cost.
The Compute Constraint:
Training Compute Estimate
For a decoder-only transformer with parameters, one forward pass on one token requires approximately FLOPs (one multiply-add per parameter). The backward pass requires approximately FLOPs (roughly 2x the forward pass). Training on tokens requires:
This is the "6ND rule." It ignores attention cost ( per layer) but is accurate within a factor of 2 for typical model sizes and context lengths.
This formula is the bridge between the abstract scaling laws and concrete training decisions. Given a GPU cluster with a known FLOP budget, you can directly compute the Chinchilla-optimal and .
Emergent Abilities at Scale
Emergent Abilities
An ability is described as emergent when it is absent in smaller models but appears in larger models. Wei et al. (2022) documented several tasks where model performance was near random below a critical scale and then sharply improved:
- Few-shot arithmetic: near zero below ~10B parameters, then rapid improvement
- Multi-step reasoning: absent in small models, present in large ones
- Code generation: qualitative jump at sufficient scale
The claim: some capabilities emerge discontinuously as a function of scale, rather than improving smoothly.
Schaeffer, Miranda, and Koyejo (2023) argued that the apparent discontinuity is a property of the evaluation metric, not the model. Their argument is sharp and worth stating as a proposition.
Metric-Induced Emergence (Schaeffer-Miranda-Koyejo)
Statement
Let be the per-token probability of producing the correct next token, viewed as a smooth function of model scale (for example, a power law ). The probability of an exact-match correct answer on a length- sequence is approximately
For any fixed this curve has an inflection point that becomes arbitrarily sharp as grows. With to (typical for arithmetic and multi-hop reasoning), the curve transitions from to within roughly one order of magnitude in , even though itself varies smoothly across many orders of magnitude.
If the same data is rescored with a continuous metric (token edit distance, Brier score, log-probability of the gold answer), the apparent discontinuity disappears and the curve recovers the smooth scaling of .
Empirically, Schaeffer et al. (2023) showed that on the Big-Bench tasks exhibiting "emergence" under exact match, switching to token edit distance or Brier score eliminated the sharp transition. They reproduced the mechanism on a controlled InstructGPT integer arithmetic experiment and even induced apparent emergence on a vision benchmark by changing only the metric.
Intuition
A long-string exact-match metric is a many-to-one nonlinearity: for some threshold , which in expectation behaves like a step function of . Smooth improvements in get compressed into a narrow band of values. The model is not undergoing a phase transition; the metric is.
Proof Sketch
For independent per-token correctness with probability , . Differentiating, , which peaks at the inflection and grows linearly with . So the slope of the accuracy curve in is , scaling with . For the slope concentrates at near , producing the visual impression of a sudden jump. The exact-claim experiment in Schaeffer et al. (2023) is the InstructGPT family on -digit integer addition: per-token log-prob of the gold digits scales smoothly, sequence-level accuracy shows the canonical "emergent" curve, and Brier-score rescoring recovers the smooth trend.
Why It Matters
The proposition does not say emergent abilities never happen. It says the default evaluation pipeline manufactures the appearance of emergence even when the underlying capability is improving smoothly. Two consequences for practice. (i) Forecasting: pretraining loss extrapolates well, but exact-match downstream benchmarks do not, even when the underlying capability is following the same smooth trend. (ii) Safety: a narrative of "sharp, unpredictable capability jumps" should not be supported by evidence that is generated entirely by a discontinuous metric.
Failure Mode
The model is an idealization. (a) Per-token errors are not independent: a wrong token early in a chain biases later tokens, so the true accuracy is below . (b) Some tasks have a genuine threshold structure not captured by independent-token accuracy (e.g., a model that either knows an algorithm or does not). For these, smooth metrics may also show a visible kink. (c) The proposition rules out one explanation for emergence; it does not prove the negation. Whether a given claimed emergent ability survives a continuous-metric audit is a per-task empirical question.
The empirical picture after Schaeffer et al.: on BIG-Bench, the majority of tasks formerly cited as exhibiting emergence become smooth under continuous metrics. A residual fraction (some compositional reasoning tasks in particular) retains a kink even under smooth metrics. The "are emergent abilities real?" question splits cleanly: metric-induced emergence is widespread and well-explained; capability-driven threshold behavior exists but is narrower than originally claimed.
The Decomposed Scaling Law
Decomposed Scaling Law (functional form)
Assume a transformer language model trained near convergence, with compute allocated compute-optimally per Hoffmann et al. (2022). The cross-entropy loss is modeled as:
where:
- is the parameter count, is the training token count.
- is a loss floor for the given data distribution and tokenizer. the loss of an infinite model trained on infinite data.
- and are fitted constants.
- captures limited model capacity.
- captures limited dataset information.
As or , the corresponding term vanishes, but the loss cannot go below .
Empirical fits (Chinchilla, Kaplan)
Hoffmann et al. (2022) fit , , nats on their corpus. Kaplan et al. (2020) report a different irreducible floor and different exponents on their WebText2-style corpus. These values are observations, not universal constants: they vary with data distribution, language, tokenization, architecture, and fitting procedure. Treat them as calibrated priors for a given setup, not as fundamental parameters of nature.
Even so, the decomposition is practically predictive. Once , , , , are fit on small-scale runs, the extrapolated loss of much larger runs (Chinchilla, Llama, GPT-4 class) has matched measured loss within a few percent. This is what justifies the capital expenditure on large training runs.
The predictive power is restricted to pretraining loss. Downstream task accuracy can be nonlinear in loss: small loss improvements may produce large capability jumps or none at all.
Compute-Optimal Allocation under the Decomposed Form
Statement
Minimize subject to the compute constraint . The optimal satisfies:
The two exponents sum to 1, consistent with .
Intuition
Each term and is convex in its variable. The constraint is a hyperbola in . The minimum is where the marginal loss reduction per unit compute is equal across and . Allocating more compute to the resource with the faster-decaying term gives diminishing returns sooner, so the optimum balances the two exponents.
Proof Sketch
Form the Lagrangian . First-order conditions:
Eliminating gives , so . Combined with , solve for in :
and symmetrically .
Why It Matters
This is the provable skeleton behind Chinchilla-style allocation rules. Whatever the fitted , the Lagrangian argument fixes the functional form of optimal allocation. Only the specific exponent ratio depends on the empirical fit.
Failure Mode
The result assumes the functional form holds globally. If the true loss deviates (data repetition, data-constrained regimes, curriculum effects, architectural phase transitions), the allocation prescription breaks. The FLOP estimate also ignores attention cost, which grows with context length and can dominate in long-context training.
Data-Constrained Scaling
Chinchilla and the decomposed law assume each training token is unique. Frontier runs have moved past the easy-to-curate web: Llama 3 trained on 15T tokens, Qwen and DeepSeek on similar scales, and the supply of high-quality unique text in major languages is finite. The relevant question becomes: when you must repeat data, how much does it cost?
Muennighoff et al. (2023) ran the experiment (400 model variants, up to 9B parameters, 900B training tokens) and fit a clean modification of the Chinchilla form.
Data-Constrained Scaling Law (Muennighoff)
Statement
Let denote the unique tokens in the corpus and the compute-optimal Chinchilla parameters for those tokens. With repetitions and excess parameters, the loss is
with effective counts that exponentially saturate:
The fitted half-life (so each repeated token loses of its value after epochs). The corresponding is smaller, so excess parameters decay faster than repeated data. Empirically:
- Up to epochs of repetition produce loss within noise of training on fresh data.
- Beyond , the marginal value of additional repetition decays to zero; spending the same compute on more parameters or stopping training becomes preferable.
- Code data tolerates more repetition than natural-language text. Filtering augmentations (e.g., perplexity-filtered repeats) extend the effective half-life modestly but do not change the qualitative shape.
Intuition
Repeating a token that the model has already memorized provides almost no new information. The first few epochs still extract residual structure (rare n-grams, long-range dependencies). After enough repetitions, the loss term saturates because stops growing, and additional compute is wasted unless reallocated to capacity. The exponential form is empirical, but it has a clean reading: each new epoch contributes geometrically less effective data than the previous one.
Proof Sketch
The functional form is fitted, not derived. A heuristic for why it is roughly the right form: assume each repeated pass extracts a fraction of the residual information. The total information extracted after passes is the geometric sum
This has the same saturating shape as with . Muennighoff et al. fit the exponential rather than the geometric form because it generalized better across model sizes and corpora.
Why It Matters
The data-constrained law turns "are we running out of data?" into a precise quantity. Given a unique-token budget and a compute budget , you can solve for the allocation that minimizes loss subject to the constraint that . The result: in data-constrained regimes the optimal shifts upward relative to fresh-data Chinchilla (more parameters, fewer effective tokens). It also bounds how much synthetic-data and augmentation-only strategies can buy you: if augmented samples behave like repeats, the half-life is the binding constraint. Recent practice (DeepSeek, Llama 3, Qwen) is consistent with the epoch finding.
Failure Mode
Three caveats. (a) The fit was done on web-scale natural-language corpora with a particular filtering pipeline; the effective depends on data diversity and filtering. High-quality curated repeats (textbook-style) behave better than uniform random repeats. (b) The model assumes the same loss objective and tokenizer throughout. Curriculum changes (different data mixes per epoch) violate the additive form. (c) The law speaks to validation loss. Downstream task performance has its own dependence on data freshness and may degrade more or less sharply than loss does.
Repetition is not literally free up to 4 epochs
The " epochs is fine" result is about loss on a held-out distribution, fit at the granularity of % noise. Specific downstream benchmarks can degrade earlier under repetition (especially memorization- sensitive evaluations like exact-string recall) or later (compositional benchmarks where the model needs more passes to learn rare structures). Use as a useful default, not as a guarantee.
Scaling for Downstream Tasks vs Pretraining Loss
Pretraining loss scales smoothly and predictably. Downstream task performance does not always follow the same smooth curve:
- Smooth scaling: Tasks well-correlated with language modeling (text generation quality, perplexity) scale smoothly with loss.
- Threshold scaling: Tasks requiring specific capabilities (multi-step reasoning, tool use) may show sharp transitions at particular loss levels.
- Saturation: Some tasks saturate quickly (sentiment analysis) while others continue improving with scale (complex reasoning).
This means you cannot simply extrapolate a scaling curve for an individual task. You can predict the loss of a 10x larger model with high confidence, but predicting whether it will pass a specific evaluation benchmark requires understanding the relationship between loss and that specific capability.
Scaling Beyond Dense Transformers
Kaplan and Chinchilla fit dense decoder-only transformers. Two extensions matter in current practice: mixture-of-experts (MoE) scaling, and width-depth parametrization.
MoE scaling laws. Mixture-of-experts models activate only a subset of parameters per token. The active parameter count governs inference cost and largely governs loss, but total parameters also contribute via increased capacity. Krajewski et al. (2024) fit a joint scaling law and find that the compute-optimal allocation shifts: under a serving-cost constraint that penalizes , a given loss is reached more cheaply by a sparse model with larger and smaller . DeepSeek MoE (Dai et al. 2024) and DeepSeek-V3 pushed this regime with fine-grained expert segmentation (many small experts, high routing capacity) and reported competitive loss at roughly an order of magnitude fewer active FLOPs than dense baselines at the same quality.
The practical takeaway: Chinchilla's was derived for dense models. MoE runs often operate at much higher token-to-active-parameter ratios because active parameters are cheap to scale without increasing the training-time FLOP cost per token in the same way.
muP (maximal update parametrization). Yang and Hu (2022) showed that with a specific per-layer rescaling of initialization variance, learning rates, and output multipliers, the optimal hyperparameters (learning rate, weight decay, initialization scale) become approximately invariant to width. This means a small proxy model can be tuned cheaply, and the same hyperparameters transfer to a much larger target model without re-tuning. Tensor Program V (Yang et al. 2022) extends this to depth transfer and more general architectures. GPT-4 scale training runs reportedly use muP-style transfer to avoid re-sweeping hyperparameters at every width. This is a scaling-law story in a different sense: the tuning cost no longer scales with , which changes the economics of large runs. See also convex tinkering for the broader principle of bounded-downside scale-up experiments.
Test-Time Compute Scaling
Kaplan and Chinchilla describe how loss scales with compute spent at training time. A separate axis has become central since 2024: how performance scales with compute spent at inference time on a fixed trained model. OpenAI o1 and DeepSeek R1 are the canonical examples. models trained with reinforcement learning to produce long chains of thought that consume far more tokens per query than standard decoding.
The headline empirical result: on reasoning-heavy tasks, adding inference compute to a smaller model can match or beat the accuracy of a much larger model decoded once. The training-compute-versus-inference-compute trade is not fixed by the Chinchilla analysis, which only accounts for pretraining loss.
Best-of-N Sampling
Given a prompt , draw independent completions from the model at temperature , then select one using a scoring rule :
The score can be a learned verifier (process reward model, outcome reward model), a ground-truth checker for tasks with verifiable answers (math, code unit tests), or majority vote over final answers (self-consistency). Inference FLOPs scale as roughly times the cost of a single completion.
Coverage scales predictably with samples. Brown et al. (2024), "Large Language Monkeys," measured pass@ (the fraction of problems solved by at least one of samples) across math and code benchmarks. Coverage follows an approximate power law in over several orders of magnitude: log pass@ is roughly linear in . This is a different object from pass@1: it isolates what the model can produce from what a selector does select. A large gap between pass@ and best-of- accuracy indicates a verifier bottleneck, not a generation bottleneck.
Test-time compute can substitute for parameters. Snell et al. (2024), "Scaling LLM Test-Time Compute Optimally," studied how a fixed compute budget should be spent across pretraining and inference: one large model decoded once, a small model sampled many times with a verifier, or tree search guided by a process reward model. On MATH-class problems, a smaller model with an optimal test-time strategy matched the accuracy of a roughly larger pretrained model decoded greedily once, when total compute (pretraining FLOPs plus inference FLOPs aggregated across the expected number of queries) is held equal. The optimal strategy depends on problem difficulty: easy problems favor more parallel samples, hard problems favor sequential revision and search. The small-model advantage shrinks or reverses on the hardest problems, where a single large-model rollout reasons over modes that no amount of small-model sampling can cover.
Reasoning models as RL over chains of thought. OpenAI o1 and DeepSeek R1 are trained with RL against verifiable rewards to produce long internal reasoning traces before answering. The scaling axis here is not samples but tokens per answer. Reported curves show accuracy rising smoothly with thinking-token budget, analogous to a scaling law but with inference tokens rather than training tokens on the -axis. The mechanism is not fully characterized: candidates include genuine search inside the trace, error correction, and in-context amortization of otherwise parametric knowledge.
Inference budget trade-off
A team has a fixed per-query inference budget of FLOPs and wants to maximize accuracy on a reasoning benchmark. Two options:
- Deploy a -parameter model and decode once. Cost per query is roughly FLOPs for an output of length .
- Deploy an -parameter model and sample candidates, then select with a verifier of comparable cost. Cost per query is roughly FLOPs, also about .
Under Snell et al.'s empirical curves on MATH, the small-model-with-verifier option wins at equal inference FLOPs when the verifier is well-calibrated and problems are not at the extreme tail of difficulty. At the hardest problems, the large model's single-shot reasoning can dominate because no amount of sampling from the weak model covers the correct solution.
The lesson is not "sampling always wins." It is that training compute and inference compute are two independent knobs, and Chinchilla only pins down the first. The optimal serving configuration depends on query distribution, verifier quality, and latency constraints.
Test-time scaling does not repeal training scaling
Reasoning models still benefit from larger base models. The RL-over-reasoning regime composes with pretraining scale rather than replacing it. A small model with a huge inference budget has a ceiling set by what its policy can represent. Test-time compute shifts the Pareto frontier of (training FLOPs, inference FLOPs, accuracy), it does not collapse it to a single axis.
Coverage is not accuracy
pass@ measures whether any sample is correct. Best-of- accuracy measures whether the selected sample is correct. The two can diverge sharply when the verifier is weak, the reward model is miscalibrated, or the task has many plausible-looking wrong answers. Reporting pass@ without a selector overstates deployable performance.
Common Confusions
Chinchilla-optimal is not always practically optimal
Chinchilla minimizes loss per training FLOP. But if you will serve the model to billions of requests, inference cost (which scales with ) dominates total cost. The practical optimum trains a smaller model on more data than Chinchilla suggests. Llama 3 70B was trained on 15T tokens (roughly 214 tokens per parameter), far beyond the Chinchilla ratio of 20. This is intentional: over-training reduces for a given quality level, cutting inference costs.
Power laws do not mean linear improvement
A power law means you need approximately 10x more parameters for each 15% reduction in loss. Improving from 3.0 to 2.5 nats requires roughly 100x scale. Improving from 2.5 to 2.0 nats requires another 100x. The returns are diminishing in absolute terms, though constant in relative (percentage) terms. This is why training frontier models costs hundreds of millions of dollars for incremental improvements.
Scaling exponents are not universal constants
The exponents , depend on the architecture, data distribution, and tokenizer. Different studies report different values. The qualitative result, power-law scaling with equal allocation of compute, is robust, but the exact exponents should not be treated as fundamental constants of nature.
Summary
- Loss scales as power laws in , , and : , .
- Kaplan (2020): favor large models with moderate data; the under-converged smaller models in the fit biased the conclusion toward .
- Chinchilla (2022): scale and equally with compute; tokens per parameter for the fitted exponents.
- Compute estimate: FLOPs for training; the Lagrangian gives for any fitted .
- Decomposed form: with irreducible entropy .
- Data-constrained scaling (Muennighoff 2023): up to epochs of repetition behave like fresh data; effective tokens saturate exponentially with half-life .
- Metric-induced emergence (Schaeffer 2023): sharp downstream "emergence" curves are mostly artifacts of exact-match scoring of length- outputs; smooth metrics recover smooth scaling on the same model families.
- Inference cost favors over-training smaller models beyond Chinchilla-optimal; knowledge distillation offers another path to efficient smaller models.
- Test-time compute is a second scaling axis: best-of- and RL-trained reasoning traces can substitute inference FLOPs for parameters on reasoning tasks (Snell et al. 2024, Brown et al. 2024).
Exercises
Problem
You have a compute budget of FLOPs. Using the Chinchilla-optimal ratio of 20 tokens per parameter and the rule, what are the optimal model size and data size ?
Problem
Suppose the loss function is with , , , , . Subject to the constraint , use Lagrange multipliers to show that the optimal allocation satisfies and find in terms of and .
Problem
Suppose you have B unique tokens and a compute budget of FLOPs. The fresh-data Chinchilla recipe asks for . Using the Muennighoff data-constrained scaling law with , qualitatively describe how the optimal shifts when you must reuse data, and compute the rough number of epochs you would run if you set to the fresh-data Chinchilla value but had only B unique tokens.
Problem
The emergent abilities debate centers on whether discontinuities in benchmark performance are real or artifacts of the evaluation metric. Using the Schaeffer-Miranda-Koyejo model, derive the location of the apparent inflection point in as a function of task length . Then design a controlled experiment that distinguishes metric-induced emergence from a genuine capability threshold.
Related Comparisons
Frequently Asked Questions
- What does Chinchilla-optimal mean?
- Hoffmann et al. (2022) ran a finer-grained scaling sweep than Kaplan et al. (2020) and found compute-optimal training requires roughly 20 tokens per parameter. Pre-Chinchilla flagship models (GPT-3 included) were drastically undertrained: at the same compute, halving parameters and doubling tokens yields lower loss.
- Why did Kaplan and Chinchilla disagree?
- Kaplan's experimental design fixed a learning-rate schedule that did not scale appropriately with training horizon, biasing the conclusion toward parameter scaling. Chinchilla varied both N and D simultaneously with proper schedule scaling and found the loss surface had a different shape. The disagreement was experimental design, not a deeper theoretical conflict.
- Are emergent abilities real?
- Schaeffer et al. (2023) showed many 'emergent' phase transitions disappear under smooth metrics: nonlinear scoring (exact-match accuracy on a long answer) created the appearance of discontinuity, while smooth metrics (per-token log-prob) showed continuous scaling. Some behaviors (chain-of-thought helping at all, code generation past a quality threshold) still appear meaningfully discontinuous to users. The phenomenon is partly a measurement artifact, partly real.
- What changes when training data is constrained?
- Muennighoff et al. (2023): training on repeated tokens has diminishing but real returns. Up to roughly 4-5 epochs of repetition matches unique-token training; beyond that, returns flatten but stay positive for several more epochs. The 'data wall' as a hard cliff is too pessimistic; the wall is a gradient with non-trivial slope.
- Do scaling laws apply to MoE models?
- Yes, with extensions. Active parameters drive loss similar to dense scaling; total parameters add a separate axis with smaller exponent (sparser models capture less of the dense return). Krajewski et al. (2024), DeepSeekMoE, and the DeepSeek-V3 technical report extend the dense Chinchilla picture to sparse MoE regimes with explicit active-parameter and total-parameter tradeoffs.
References
Core neural scaling:
- Kaplan et al., Scaling Laws for Neural Language Models (2020). arXiv:2001.08361
- Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, 2022). arXiv:2203.15556
- Henighan et al., Scaling Laws for Autoregressive Generative Modeling (2020). arXiv:2010.14701
Data limits and repeated data:
- Muennighoff et al., Scaling Data-Constrained Language Models (NeurIPS 2023). arXiv:2305.16264
- Villalobos et al., Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning (2022). arXiv:2211.04325
- Hernandez et al., Scaling Laws and Interpretability of Learning from Repeated Data (2022). arXiv:2205.10487
Emergence and evaluation:
- Wei et al., Emergent Abilities of Large Language Models (TMLR 2022). arXiv:2206.07682
- Schaeffer, Miranda, Koyejo, Are Emergent Abilities of Large Language Models a Mirage? (NeurIPS 2023). arXiv:2304.15004
- Srivastava et al., Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (2022). arXiv:2206.04615
MoE, hyperparameter transfer, and test-time compute:
- Krajewski et al., Scaling Laws for Fine-Grained Mixture of Experts (2024). arXiv:2402.07871
- Dai et al., DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models (2024). arXiv:2401.06066
- DeepSeek-AI, DeepSeek-V3 Technical Report (2024). arXiv:2412.19437
- Yang et al., Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (muP, 2022). arXiv:2203.03466
- Snell et al., Scaling LLM Test-Time Compute Optimally (2024). arXiv:2408.03314
- Brown et al., Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2024). arXiv:2407.21787
Next Topics
- Chain-of-Thought and Reasoning: the test-time compute axis and the Merrill-Sabharwal expressiveness ceiling
- Compute-Optimal Training: Chinchilla in practice
- KV cache: the inference-side cost that makes Chinchilla-optimal models expensive to serve
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
8- Convex Optimization Basicslayer 1 · tier 1
- Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
- Ineffable Intelligencelayer 4 · tier 2
- Lazy vs Feature Learninglayer 4 · tier 2
- Data Contamination and Evaluationlayer 5 · tier 2
Derived topics
7- Chain-of-Thought and Reasoninglayer 5 · tier 1
- GPT Series Evolutionlayer 5 · tier 2
- Inference-Time Scaling Lawslayer 5 · tier 2
- LLaMA and Open Weight Modelslayer 5 · tier 2
- Scaling Compute-Optimal Traininglayer 5 · tier 2
+2 more on the derived-topics page.