Synthetic Data Generation

Sneiderman, Robby

Methodology

Synthetic Data Generation

Using models to generate training data: LLM-generated instructions, diffusion-based image augmentation, code synthesis. When synthetic data helps (low-resource, privacy) and when it hurts (model collapse).

AdvancedTier 2FrontierSupporting~45 min

Prerequisites

Common Probability Distributions

Prereq Map

Why This Matters

Real training data is expensive, private, or scarce. Synthetic data generation uses existing models to produce new training examples. This approach has become central to modern ML: post-training pipelines for instruction-tuned models (Alpaca, Vicuna, WizardLM) rely heavily on LLM-generated instruction data, the Phi series was trained on curated "textbook-quality" synthetic data (Gunasekar et al. 2023, "Textbooks Are All You Need"), and diffusion models are used to augment training sets for downstream classifiers. The exact composition of frontier pretraining corpora for models like GPT-4 and the Llama family is not public, so we avoid specific claims about them here.

But synthetic data is not free lunch. Training on model-generated data can degrade model quality in predictable ways. Understanding when synthetic data helps and when it hurts requires understanding the distributional relationship between real and generated data.

Core Definitions

Definition

Synthetic Data

Synthetic data is training data generated by a model rather than collected from the real world. Let $p_{\text{real}}$ be the true data distribution and $p_\theta$ be a generative model. Synthetic data consists of samples $\tilde{x} \sim p_\theta$ . The quality of synthetic data depends on how well $p_\theta$ approximates $p_{\text{real}}$ .

Definition

Self-Instruct Pipeline

A method for generating instruction-following training data using an LLM. Starting from a small seed set of human-written instructions, the LLM generates new instructions, inputs, and outputs. These are filtered (by heuristics or a separate model) and added to the training set. The process can iterate.

Text: LLM-Generated Instruction Data

The self-instruct method (Wang et al. 2023, ACL) and Stanford Alpaca (Taori et al. 2023, GitHub release, not a peer-reviewed paper) demonstrated that a strong LLM can generate instruction-following data that, when used for fine-tuning, transfers the instruction-following capability to weaker models.

The pipeline:

Start with a seed set of $S$ human-written (instruction, input, output) triples
Prompt the teacher LLM to generate new instructions similar to the seeds
For each new instruction, generate input-output pairs
Filter for quality, diversity, and correctness
Fine-tune the student model on the combined dataset

Key empirical finding (with caveats): for vanilla imitation learning on teacher outputs, student quality is usually bounded by teacher quality, and a student fine-tuned on GPT-4 outputs typically outperforms one fine-tuned on GPT-3.5 outputs. This is a strong baseline, not a hard theoretical ceiling. Two lines of work show the ceiling can be exceeded: Evol-Instruct (Xu et al. 2023, WizardLM) uses iterative complexification of prompts to elicit harder teacher responses; weak-to-strong generalization (Burns et al. 2023, OpenAI) shows a strong student supervised by a weaker teacher can recover much of the strong-student ceiling on some tasks. Constitutional AI / RLAIF (Bai et al. 2022) and process reward models (Lightman et al. 2023, "Let's Verify Step by Step") use AI-generated preferences or per-step labels rather than copying teacher outputs, and can in principle exceed the teacher on the target objective.

Images: Diffusion-Based Augmentation

Diffusion models generate photorealistic images conditioned on text prompts. For training data augmentation:

Generate additional examples for rare classes (e.g., rare medical conditions)
Create variations of existing images with controlled modifications
Produce images for classes where real data collection is impractical

He et al. (2023) showed that classifiers trained on a mix of real and diffusion-generated images outperform those trained on real data alone, particularly when real data is scarce (fewer than 100 examples per class).

Code: Model-Generated Programming Problems

LLMs generate programming problems with solutions, test cases, and explanations. This is used to train code models (Code Llama, StarCoder). The generation process can include automated verification: run the generated code against the generated tests, and keep only examples that pass.

Verification is a key advantage of code synthesis: correctness is partially checkable, unlike natural language where quality is subjective.

When Synthetic Data Helps

Low-resource domains. When real data has fewer than 1000 examples, synthetic augmentation can substantially improve performance. The synthetic data provides distributional coverage that the small real dataset lacks.

Privacy preservation. Instead of sharing real patient records across hospitals, generate synthetic records that preserve statistical properties without revealing individual data points. Differential privacy guarantees can be applied to the generation process.

Diversity expansion. A small seed dataset may lack coverage of edge cases. Targeted generation can fill gaps: generate examples for underrepresented subgroups, rare events, or boundary conditions.

When Synthetic Data Hurts: Model Collapse

The failure mode discussed here is the single-generation view; see model collapse and data quality for the multi-generation distributional analysis, variance decay rates, and mitigation strategies.

Theorem

Model Collapse from Recursive Training

Statement

Consider a sequence of generative models $p_{\theta_0}, p_{\theta_1}, p_{\theta_2}, \ldots$ where $p_{\theta_0}$ is trained on real data from $p_{\text{real}}$ , and each subsequent model $p_{\theta_{k+1}}$ is trained on samples from $p_{\theta_k}$ . As $k \to \infty$ , the distribution $p_{\theta_k}$ converges to a distribution with lower variance than $p_{\text{real}}$ . Specifically, for Gaussian data with true distribution $\mathcal{N}(\mu, \sigma^2)$ and $n$ samples per generation:

$\text{Var}(p_{\theta_k}) \to 0 \text{ as } k \to \infty$

The tails of the distribution are progressively lost, and the distribution collapses toward a point mass.

Intuition

Each generation of training introduces estimation error. Finite samples from $p_{\theta_k}$ underrepresent the tails of the distribution. The next model, trained on these samples, learns a distribution with thinner tails. Over iterations, the tails vanish entirely. The distribution "forgets" its variance.

Proof Sketch

For the Gaussian case, suppose each generation fits $p_{\theta_k}$ to $n$ samples drawn from $p_{\theta_{k-1}}$ using the maximum-likelihood variance estimator $\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_i (x_i - \bar{x})^2$ , which has $\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \sigma^2 \cdot (n-1)/n$ . Iterating gives $\mathbb{E}[\hat{\sigma}_k^2] = \sigma^2 \cdot ((n-1)/n)^k \to 0$ . Note the unbiased estimator $s^2 = \frac{1}{n-1}\sum_i (x_i - \bar x)^2$ does not exhibit this deterministic contraction in expectation. Collapse in the unbiased case is driven instead by the random walk of $\log \hat\sigma^2$ , whose variance accumulates over generations, eventually hitting arbitrarily small values with probability 1. Either way, the mechanism is sampling noise combined with recursive refitting, not a specific estimator choice. For non-Gaussian distributions the argument extends via the finite capacity of the model, which cannot represent the full complexity of the distribution from finite samples.

Why It Matters

Model collapse is a central risk of synthetic data. If the web increasingly contains LLM-generated text, and LLMs are trained on web data, a recursive loop could degrade model quality over generations. Shumailov et al., "AI models collapse when trained on recursively generated data" (Nature 2024; earlier arXiv 2023 version titled "The Curse of Recursion") give the clean theoretical demonstration.

Failure Mode

The strong "replace" framing (each generation sees only the previous generation's synthetic data) is the worst case. Gerstgrasser et al., "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data" (2024) show that if each generation accumulates real and synthetic data rather than replacing real with synthetic, variance stays bounded and collapse is avoided. This matches the empirical rule of thumb that mixing even a modest fraction of real data (often quoted as 10-20%, though the exact threshold is distribution-dependent) prevents collapse in practice. The original theorem assumes pure recursive training on the previous generation's synthetic samples only, which is the worst case and not how real pipelines work.

report a correction →

Decontamination

When using synthetic data for training, benchmark contamination is a risk: the generating model may produce examples that overlap with evaluation benchmarks. If synthetic training data contains test set answers, reported performance is inflated.

Decontamination procedures:

N-gram overlap filtering: remove synthetic examples with high n-gram overlap with known benchmarks
Embedding similarity filtering: remove examples whose embeddings are close to benchmark items
Provenance tracking: record which model generated each example and what prompt was used

Modality and Verifiability

The cost of synthetic-data quality control scales with how cheap it is to check the output. This table summarizes the three regimes covered above:

Modality	Generator	Verification signal	Failure mode if skipped
Code	Code LLM (Code Llama, StarCoder)	Unit tests, compilers, type checks	Plausible but wrong; hides bugs under passing edge cases
Images	Diffusion model (Stable Diffusion, SDXL, Imagen)	Classifier filter, expert review, near-duplicate check	Spurious artifacts the classifier learns as class features (He et al. 2023)
Text	Instruction LLM (GPT-4, Claude, Llama-Instruct)	N-gram overlap, embedding similarity, process reward model, judge LLM	Benchmark contamination, mode collapse onto teacher's style

Verifiability is the asymmetry that makes code-synthesis pipelines harder to abuse than text-synthesis pipelines: a unit test either passes or it does not. For text, judge-LLM filtering is itself subject to the same biases as the generator, which is why Lightman et al. (2023) argue for per-step process rewards rather than outcome-only scoring.

Common Confusions

Watch Out

The imitation ceiling is a tight baseline, not a theoretical bound

Vanilla supervised fine-tuning on teacher outputs usually produces a student bounded by the teacher. This is a strong empirical baseline, not a hard information-theoretic ceiling. Two mechanisms are known to exceed it. (1) Weak-to-strong generalization (Burns et al. 2023, OpenAI): a strong pretrained student, supervised by a weak teacher, can recover a large fraction of the gap to strong-teacher supervision because the student's pretraining already encodes much of the target capability and the weak signal only has to point at it. (2) Process reward models and verifier-augmented training (Lightman et al. 2023, "Let's Verify Step by Step"; Uesato et al. 2022, arXiv:2211.14275): rewarding per-step correctness rather than copying the teacher's final answer lets the student outperform the labeler on the target metric. RLAIF / Constitutional AI (Bai et al. 2022) uses AI preferences rather than imitation, which is a third route. When a paper claims a "student exceeds teacher" result, check whether the training signal is imitation (unlikely to exceed) or preference/process-reward (can exceed).

Watch Out

Curated synthetic pretraining is a different regime from self-distillation

Self-instruct (Wang et al. 2023) and "textbook-quality" pretraining (Gunasekar et al. 2023, Phi) are often lumped together as "synthetic data". The two are not in the same regime. Self-instruct takes a fixed teacher and generates many cheap examples; it saves labor and transfers capability downward. Textbook-quality synthetic pretraining is aggressively filtered: the teacher generates, and a separate classifier keeps only the small fraction that matches a target educational style. The effective training distribution is the filter's output, not the generator's. Phi results come from the filter doing most of the work, not from the teacher being magical. Analyze a synthetic-data pipeline by asking which distribution ends up in training: the generator, the filter, or a curated intersection.

Watch Out

Synthetic data is not the same as data augmentation

Data augmentation applies label-preserving transformations to real data (rotation, cropping, noise). Synthetic data generation creates entirely new examples, potentially for new classes or scenarios not present in the real data. Augmentation does not literally preserve the input distribution — a flipped or cropped image is itself a sample from a transformed distribution — but it is anchored to the real data through transformations chosen to preserve the label. Synthetic generation is not anchored in the same way: it relies on the generator's modeled distribution, which can drift further from the real one, especially under repeated use.

Watch Out

Model collapse requires recursive training, not one-time generation

Generating synthetic data once from a model trained on real data does not cause model collapse. The problem arises when model outputs are used to train the next model, whose outputs train the next, and so on. A single generation step introduces bounded error. Recursive application amplifies it.

Watch Out

MAD vs model collapse

Alemohammad et al. (2023), "Self-Consuming Generative Models Go MAD", use MAD ("Model Autophagy Disorder") for the same family of phenomena in self-consuming generative loops (images in particular). Their framework distinguishes a fully synthetic loop (collapse) from a fresh-data loop (real data injected each generation, bounded degradation) and a synthetic- augmentation loop (real + synthetic mix). The "curse of recursion" result is the fully-synthetic case in their taxonomy. Use "collapse" and "MAD" interchangeably; prefer MAD when the taxonomy distinction matters.

Canonical Examples

Example

Alpaca: self-instruct at scale

Stanford Alpaca fine-tuned Llama-7B on 52k instruction-following examples generated by GPT-3.5 (text-davinci-003). The cost of generating the training data was under 500 USD. The resulting model showed instruction-following ability comparable to GPT-3.5 on simple tasks, demonstrating that synthetic data can cheaply transfer capabilities between models.

Summary

Synthetic data is generated by models, not collected from the real world
Self-instruct pipelines use a teacher LLM to generate training data for student models
Synthetic data is most valuable in low-resource, privacy-sensitive, or diversity-limited settings
Model collapse occurs under pure recursive training on synthetic outputs; accumulating real and synthetic data (Gerstgrasser 2024) avoids it
Vanilla imitation bounds the student at the teacher, but Evol-Instruct, weak-to-strong, and RLAIF-style objectives can exceed the teacher on targeted axes
Decontamination is necessary to avoid benchmark inflation

Exercises

ExerciseCore

Problem

You have 50 labeled examples of a rare disease in medical images and access to a diffusion model. Describe a pipeline for generating synthetic training data and identify the main risk.

ExerciseCore

Problem

A training pipeline runs in two modes. Replace: at generation $k$ , the model trains on $N = 10^6$ samples drawn entirely from the previous generation's model. Accumulate: at generation $k$ , the model trains on $N$ real samples (fixed from generation 0) plus all synthetic samples produced at each prior generation. Suppose each generation produces $M = 10^5$ synthetic samples. (a) Under accumulate, what fraction of the training data at generation $k = 10$ is synthetic? (b) Qualitatively, which mode does Gerstgrasser et al. (2024) predict will have bounded loss across generations, and what mechanism keeps the accumulate mode stable?

ExerciseAdvanced

Problem

The model collapse theorem shows variance contracts by a factor of $(n-1)/n$ per generation for Gaussian data. If the original variance is $\sigma^2 = 1$ and you use $n = 1000$ samples per generation, after how many generations does the variance drop below $0.5$ ? Below $0.01$ ?

References

Canonical:

Wang et al., "Self-Instruct: Aligning Language Models with Self-Generated Instructions" (ACL 2023)
Taori et al., "Stanford Alpaca: An Instruction-following LLaMA Model" (Stanford CRFM, GitHub release, 2023)

Current:

Shumailov et al., "AI models collapse when trained on recursively generated data" (Nature 2024; earlier arXiv 2023 "The Curse of Recursion")
Gerstgrasser et al., "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data" (2024)
He et al., "Is Synthetic Data from Generative Models Ready for Image Recognition?" (ICLR 2023)
Alemohammad et al., "Self-Consuming Generative Models Go MAD" (ICLR 2024)
Gunasekar et al., "Textbooks Are All You Need" (2023, Phi series)
Xu et al., "WizardLM: Empowering Large Language Models to Follow Complex Instructions" (2023, Evol-Instruct)
Burns et al., "Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision" (2023, OpenAI)
Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (2022, Anthropic; RLAIF)
Lightman et al., "Let's Verify Step by Step" (2023, arXiv:2305.20050, process reward models)
Uesato et al., "Solving math word problems with process- and outcome-based feedback" (2022, arXiv:2211.14275). Earlier process vs outcome comparison; complements Lightman et al.

Next Topics

Data augmentation theory: formal framework for augmentation as distribution transformation
Data contamination and evaluation: detecting and preventing benchmark contamination

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Common Probability Distributionslayer 0A · tier 1

Derived topics

4

Data Augmentation Theorylayer 2 · tier 2
Synthetic Data Distillationlayer 3 · tier 2
Data Contamination and Evaluationlayer 5 · tier 2
Model Collapse and Data Qualitylayer 5 · tier 2

Graph-backed continuations

Data Augmentation Theory Data Contamination and Evaluation Model Collapse and Data Quality Synthetic Data Distillation