LLM Construction
Reasoning Data Curation
How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.
Why This Matters
The models that reason well (o1, DeepSeek-R1, Claude extended-thinking models) were not trained on generic internet text alone. Their post-training mixes include curated reasoning traces, and in some publicly documented cases (notably DeepSeek-R1, 2025) the pipeline explicitly filters for verified-correct solutions. OpenAI and Anthropic system cards describe broader data mixes rather than a verified-only pipeline, so the strong claim should not be projected onto the entire class. The quality of reasoning training data still determines the ceiling for reasoning capability. Scaling incorrect or unverified reasoning data makes models more fluent at being wrong, not better at being right.
Mental Model
The core principle: for reasoning tasks, you can verify correctness even when you cannot generate correct solutions. A math proof checker can verify a proof without being able to produce one. A test suite can verify code without writing it. This asymmetry between generation and verification is what makes reasoning data curation possible at scale.
The pipeline: (1) source or generate problems, (2) generate many candidate solutions, (3) verify solutions with external tools, (4) keep only verified correct solutions, (5) train on the verified data.
Curate reasoning data by turning generation into a verified funnel
The model does not need to be reliable on every attempt. It only needs to produce enough candidates that an external verifier can separate good reasoning traces from fluent garbage.
Why verification changes the game
Reasoning curation works because correctness can be checked more cheaply than it can be generated. A model may fail often, but best-of-N plus a trustworthy verifier can still yield excellent training data.
What still goes wrong
Final-answer filtering can keep logically bad chains that accidentally land on the right answer. Good pipelines mix outcome verification with stronger process checks, hidden tests, or formal proof tools where possible.
Production reality
The expensive step is not storing the kept traces. The expensive step is generating enough candidates and running enough external checks that the retained set is genuinely clean.
Types of Reasoning Data
Verifiable Reasoning Data
Verifiable reasoning data consists of (problem, solution) pairs where the solution's correctness can be checked by an external system:
- Math with ground truth: competition problems with known numerical answers, theorem proving with formal proof assistants (Lean, Coq, Isabelle)
- Code with test cases: programming problems where solutions are checked by running test suites, including hidden test cases that the model never sees during training
- Science with known answers: physics, chemistry, and biology problems where the answer can be derived from first principles or looked up
- Logic puzzles: constraint satisfaction problems where a solution can be verified by checking all constraints
The key property: verification is cheap and reliable, even when generation is hard.
Rejection Sampling
Rejection Sampling Improves Solution Quality
Statement
If a model generates independent solutions to a problem and each solution is correct with probability , then the probability that at least one correct solution exists among the samples is:
The expected number of correct solutions is . If we select uniformly at random from the correct solutions, the resulting training example is guaranteed correct (given a perfect verifier). To achieve :
Intuition
Generate many attempts, keep the ones that work. If the model solves a problem 10% of the time, generating 50 attempts gives a 99.5% chance of at least one correct solution. The verified correct solutions become high-quality training data. The model is effectively distilling its own best-case behavior into reliable behavior.
Proof Sketch
Each sample is correct independently with probability . The probability all fail is . So . Setting this and solving: , so . Using for small gives the approximation.
Why It Matters
Rejection sampling is a primary method for generating reasoning training data at scale. DeepSeek-R1 (2025) documents it explicitly as part of its pipeline. Other reasoning model releases describe curated reasoning data without specifying the exact filtering procedure. The method works even when the base model's pass rate is low (say 1-5%), as long as you can afford to generate enough samples and verify them.
Failure Mode
If the verifier has false positives (accepts incorrect solutions), the training data is contaminated. If the base model's pass rate is extremely low (below 0.1%), the cost of generating enough samples becomes prohibitive. Also, rejection sampling only selects for final-answer correctness; it does not guarantee that the reasoning chain itself is valid.
Verification Methods
External Verification
External verification uses tools outside the model to check solution correctness:
- Code execution: run the generated code against test cases (both visible and hidden). Check for correctness, not just compilation
- Formal proof assistants: Lean 4, Coq, Isabelle can type-check proofs for mathematical theorems. If the proof compiles, it is correct by construction
- Symbolic math checkers: computer algebra systems (SymPy, Mathematica) can verify numerical answers and simplify expressions
- Unit tests for reasoning: for word problems, verify the final answer against known ground truth
External verifiers are the gold standard because they do not rely on the model's own judgment, which is unreliable for hard problems.
Process Reward Models (PRMs) vs Outcome Reward Models (ORMs)
When external verification is unavailable, learned verifiers substitute:
- ORMs score the final answer only. Binary signal: correct or incorrect
- PRMs score each step in the reasoning chain. Richer signal but requires step-level correctness labels, which are expensive to collect
PRMs produce better training signal because they can identify where reasoning went wrong, enabling the model to learn which steps to avoid. The cost is that PRM training data requires human annotation of individual reasoning steps.
Self-Play for Problem Generation
Self-Play Problem Generation
Self-play generates new training problems using the model itself:
- The model generates a problem (e.g., a math question)
- The model generates a solution to the problem
- An external verifier checks the solution
- If verified, the (problem, solution) pair becomes training data
- If the model cannot solve its own problem (low pass rate), the problem is "hard" and particularly valuable for training
This creates a curriculum: the model generates problems at the frontier of its own capability. Problems it solves easily are not useful for training. Problems it fails on entirely cannot produce verified solutions. The sweet spot is problems with pass rate between 1% and 50%.
The self-play loop is inspired by AlphaZero, where the system improves by playing against itself. For reasoning, "playing against yourself" means generating problems you can barely solve, then training on the solutions you do find.
Iterated Self-Training: STaR and ReST
Both STaR and ReST formalize the rejection-sampling-then-SFT recipe as an explicit iterated loop. Neither uses on-policy RL. Both reduce to repeated supervised fine-tuning on filtered self-generated data.
STaR (Zelikman et al., 2022)
STaR (Self-Taught Reasoner) bootstraps a chain-of-thought model from a small seed set of labeled problems. The loop, on a dataset of problems with known final answers:
- Sample a chain of thought and final answer from the current model for each problem
- Filter to chains whose final answer matches the ground-truth label
- For problems where no sampled chain gives the correct answer, run a rationalization step: feed the correct answer into the prompt as a hint, sample a chain that "explains" the known answer, then strip the hint from the training example
- Fine-tune the base model on the union of filtered chains and rationalized chains
- Repeat with the fine-tuned model
The critical trick is rationalization. Without it, the training set is biased toward easy problems the model already solves, and hard problems contribute nothing. By providing the answer as a hint only during data generation, STaR recovers training signal from problems the model cannot yet solve unaided, while the final training example still asks the model to produce the chain without the hint.
ReST (Gulcehre et al., 2023)
ReST (Reinforced Self-Training) replaces the correctness filter with a learned reward model and runs a Grow / Improve loop:
- Grow: sample many completions from the current policy on a fixed prompt set
- Improve: score completions with a reward model, keep the top- or those above a threshold, fine-tune the policy on the kept set
- Optionally run multiple Improve passes (with increasing thresholds) per Grow pass
- Repeat
ReST is strictly iterated supervised fine-tuning on reward-filtered data. There is no policy gradient, no importance sampling, and no on-policy KL constraint during the SFT step. Compared to RLHF, the infrastructure is simpler and the training is more stable; the cost is that off-policy data from earlier Grow steps becomes stale as the policy drifts.
ReST (Singh et al., 2023) reinterprets this loop as expectation-maximization on a latent-variable model where the latent is the reasoning chain. The E-step samples chains and filters by binary correctness (treating the verifier as an indicator function on the marginal likelihood). The M-step fits the policy to the filtered chains. Applied to MATH and APPS, ReST matches or exceeds human-data fine-tuning while using only model-generated chains.
Curation Regimes at a Glance
| Regime | Feedback source | Online? | What becomes supervision | Main risk |
|---|---|---|---|---|
| Rejection sampling | final-answer verifier | no | accepted complete solutions | only improves behaviors the current policy can already occasionally produce |
| STaR | known answer plus rationalized trace | no, iterative | filtered and rationalized chains | answer-conditioned traces can look cleaner than the model's unaided reasoning |
| ReST / ReST | reward model or correctness filter | no, iterative | reward-filtered self-generated traces | stale off-policy data and reward-model bias |
| PRM / process supervision | step-level labels or rollout-derived scores | usually no | step annotations or step scorers | process labels are expensive or proxy-like |
| RLVR | verifier reward during policy updates | yes | updated policy rather than a fixed dataset | sparse rewards, weak test coverage, verifier hacking |
Process Supervision and Synthetic Reasoning Data
PRM800K (Lightman et al., 2023)
"Let's Verify Step by Step" released PRM800K, a dataset of roughly 800,000 step-level correctness labels over solutions to MATH problems. Each step of each solution is annotated as correct, neutral, or incorrect. The dataset is used to train process reward models that score partial reasoning chains. On the MATH benchmark, process reward models trained on PRM800K outperform outcome-only reward models at selecting correct solutions from a pool of samples, particularly as the pool size grows. The paper established step-level supervision as a measurable improvement over outcome-only supervision on hard reasoning.
Math-Shepherd (Wang et al., 2023)
Math-Shepherd removes the human annotator from the PRM loop. Each step of a partial solution is scored by running tree-search rollouts from that step and estimating the probability that a completion from this prefix reaches a correct final answer. Steps with high estimated completion probability are labeled correct; low-probability steps are labeled incorrect. This yields automatic step-level labels at the cost of a large rollout budget, with no human annotation required.
AlphaProof (Google DeepMind, 2024)
AlphaProof combines a language model with Lean 4 as an external proof checker. Candidate proof steps are generated by the model, type-checked by Lean, and successful proofs are fed back as training data in an AlphaZero-style loop. AlphaProof, together with AlphaGeometry 2, reached silver-medal performance on the 2024 International Mathematical Olympiad, solving four of the six problems. The system illustrates verification-as-training-signal in its strongest form: formal proof assistants are zero-false-positive verifiers, so every accepted proof is guaranteed correct by construction.
Synthetic reasoning data at scale (Phi-4, Phi-4-reasoning, Orca)
Phi-4, Phi-4-reasoning, and the Orca series rely heavily on synthetic reasoning data generated by stronger teacher models, then filtered and rewritten before being used to train smaller students. The bet is that curated synthetic reasoning can substitute for scarce high-quality human reasoning data, provided the synthesis and filtering pipelines are themselves high quality.
Connection to RLVR
Verification as RL Reward Signal
Statement
RL with verifiable rewards (RLVR) uses the verifier output as a sparse reward signal:
The RL objective is:
where is a reference policy (typically the SFT model) and controls the KL penalty. This is equivalent to rejection sampling followed by a KL-regularized policy update.
Intuition
Instead of training on pre-collected verified solutions (rejection sampling then SFT), RLVR trains the model online: generate, verify, update policy. The RL framework handles credit assignment automatically through the reward signal. The KL penalty prevents the model from collapsing to a narrow set of solution strategies.
Why It Matters
RLVR is the approach documented for DeepSeek-R1 (2025). Other reasoning systems use related but not always identical recipes; public information on o1 and Claude extended-thinking models does not fully specify the pipeline. The advantage over rejection sampling plus SFT is that the model can explore and discover new solution strategies during training, rather than being limited to strategies it already knew. The verifier provides a reliable training signal without requiring human preference labels.
Failure Mode
Sparse binary rewards make RL optimization difficult. If the model's pass rate is very low, the reward signal is almost always zero and learning stalls. Reward hacking is possible if the verifier has exploitable weaknesses (e.g., a test suite with insufficient coverage). The KL penalty is crucial: without it, the model collapses to generating only the simplest correct solutions.
Data Quality vs Data Quantity
For reasoning, quality dominates quantity. Key empirical findings:
- Correct > plentiful: training on 10,000 verified-correct math solutions outperforms training on 100,000 unverified solutions that include errors
- Hard > easy: training on problems the model finds difficult (pass rate 1-20%) produces more improvement than training on easy problems (pass rate > 80%)
- Diverse > repetitive: solutions that use different reasoning strategies for the same problem type produce more robust reasoning than many solutions using the same approach
- Process > outcome: when available, step-by-step verified chains produce better reasoners than outcome-only verification
Audit Questions For A Reasoning Dataset
Before trusting a reasoning-data pipeline, ask:
- What exactly is verified: the final answer, each intermediate step, or only a learned score?
- Can the generator see the test suite, answer format, or proof checker interface well enough to game it?
- Are the kept solutions diverse, or are they near-duplicates of one template?
- Is the frontier difficulty calibrated, or are you mostly collecting already-solved easy problems?
- Does the kept trace justify the answer, or does it only coincide with a correct final output?
Common Confusions
Rejection sampling is not cherry-picking results
Rejection sampling generates training data, not evaluation results. You generate many solutions, keep the correct ones, and train on them. This is the training pipeline, not the evaluation protocol. At evaluation time, the model generates a single solution (or uses majority voting), and that solution is either correct or not.
A perfect verifier does not guarantee perfect reasoning
Even with a zero-false-positive verifier, the training data only contains solutions that happen to arrive at the correct final answer. The reasoning chain may contain errors that cancel out (e.g., two sign errors that compensate). This is why process reward models, which verify each step, produce stronger reasoners than outcome-only verification.
ReST is not policy-gradient RL
ReST samples outputs, filters them with a reward model or threshold, and then fine-tunes on the kept set. That is iterated supervised learning on filtered data, not PPO or REINFORCE. The infrastructure is simpler, but the data becomes stale as the policy drifts.
More compute for rejection sampling has diminishing returns
Generating samples when gives probability of finding a correct solution. But generating 10,000 samples barely helps further. The marginal value of additional samples decreases rapidly once is large. The bottleneck shifts to problem diversity, not sample count.
Summary
- Reasoning data requires verified correctness, not just fluency
- Rejection sampling: generate many solutions, keep verified correct ones
- External verifiers (code execution, proof assistants, symbolic checkers) are the gold standard
- Self-play generates problems at the model's capability frontier
- RLVR uses verifier output as RL reward, enabling online exploration
- Data quality matters more than quantity: correct, hard, diverse solutions
- Process verification (step-by-step) is stronger than outcome verification (final answer only)
Exercises
Problem
A model has a 5% pass rate on competition math problems. How many samples per problem do you need to generate to have at least a 95% chance of finding one correct solution?
Problem
You are designing a self-play loop for math reasoning. The model currently solves 30% of problems at difficulty level and 2% at level . Should you primarily train on level problems (high pass rate) or level problems (low pass rate)? Justify quantitatively by considering the cost per verified correct solution.
References
Canonical:
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)
- Cobbe et al., "Training Verifiers to Solve Math Word Problems" (2021)
- Lightman et al., "Let's Verify Step by Step" (2023). PRM800K dataset and process reward models
- Zelikman et al., "STaR: Bootstrapping Reasoning With Reasoning" (2022)
- Gulcehre et al., "Reinforced Self-Training (ReST) for Language Modeling" (2023)
Current:
- Singh et al., "Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models" (2023). ReST expectation-maximization interpretation
- Wang et al., "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" (2023)
- Google DeepMind, "AI achieves silver-medal standard solving International Mathematical Olympiad problems" (2024). AlphaProof and AlphaGeometry 2 public report
- Abdin et al., "Phi-4-reasoning Technical Report" (2025). Curated teachable prompts, teacher-generated traces, and a short RL phase
- Polu and Sutskever, "Generative Language Modeling for Automated Theorem Proving" (2020)
Next Topics
- DPO vs GRPO vs RL reasoning: how preference optimization methods use reasoning data
- Reward models and verifiers: the systems that score and verify model outputs
Last reviewed: April 23, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- AlphaProof and AI-Assisted Theorem Provinglayer 4 · tier 1
- Synthetic Data Distillationlayer 3 · tier 2
- Post-Training Overviewlayer 5 · tier 2
Derived topics
2- DPO vs GRPO vs RL for Reasoninglayer 5 · tier 2
- Reward Models and Verifierslayer 5 · tier 2
Graph-backed continuations