Skip to main content

Paper breakdown

Training Language Models to Follow Instructions with Human Feedback

Long Ouyang et al. · 2022 · NeurIPS 2022

Three-stage recipe — supervised demonstrations, reward model fit on pairwise human preferences, then PPO against the reward with a KL penalty to the supervised model — that turned a 175B base model into one human raters preferred to a 100x-larger raw GPT-3. Standardised RLHF as the post-training pipeline used by ChatGPT and the open-weight model ecosystem.

Overview

Ouyang et al. (2022) wrote down the recipe behind ChatGPT in full. The starting point is GPT-3, a 175B-parameter base model that completes text but does not follow instructions. The output is InstructGPT, a model human raters prefer to GPT-3 even when InstructGPT is 100× smaller (1.3B vs 175B). The mechanism is three sequential training stages on top of the same base.

Stage 1 is supervised fine-tuning (SFT) on a small set (~13k prompts) of human-written demonstrations of how to respond. Stage 2 trains a reward model: pick a prompt, sample several completions from the SFT model, ask human labelers to rank them, and fit a regressor that scores any prompt-completion pair. Stage 3 fine-tunes the SFT model with proximal policy optimisation against the reward, with a per-token KL penalty back to the SFT model so the policy does not drift far from the data distribution that the reward model was trained on.

The single technical contribution is small. Each ingredient existed before — SFT was standard, Bradley-Terry reward modelling went back to Christiano et al. (2017), PPO to Schulman et al. (2017). What this paper did was put the three together at GPT-3 scale, document the labelling protocol with enough rigor that other groups could reproduce it, and report the full set of tradeoffs (the "alignment tax", reward hacking, the gain from KL regularisation). It is the operating manual the open-weight ecosystem (LLaMA-2-Chat, DeepSeek-Chat, Qwen-Chat, Mistral-Instruct) used through 2023 and 2024.

Mathematical Contributions

The three-stage pipeline

StageDataLossOutput
1. SFTHuman demonstrations (x,y)(x, y)Cross-entropy on yxy \mid xPolicy πSFT\pi^{\text{SFT}}
2. Reward modelPairwise comparisons (x,yw,yl)(x, y_w, y_l)Bradley-Terry log-likelihoodReward rϕ(x,y)Rr_\phi(x, y) \in \mathbb{R}
3. PPOPrompts xx from the same distributionKL-regularised reward maximisationPolicy πθRL\pi^{\text{RL}}_\theta

The three datasets are roughly 13k SFT, 33k comparisons, 31k PPO prompts (Table 6 of the paper). Labelers are a hired team of ~40 contractors, with detailed instructions covering tone, factuality, harm avoidance, and edge cases — most of the paper's appendices document this protocol.

Reward modelling

For a prompt xx and two candidate completions ywy_w (preferred by the labeler) and yly_l (not preferred), the reward model rϕr_\phi is fit under the Bradley-Terry assumption: the probability the labeler picks ywy_w over yly_l is

Pr(ywylx)=σ ⁣(rϕ(x,yw)rϕ(x,yl))\Pr(y_w \succ y_l \mid x) = \sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)

with σ\sigma the logistic function. The training loss (Equation 1) is the negative log-likelihood under this model, averaged over all (K2)\binom{K}{2} pairs from a KK-way ranking (K=4K = 4 to 99 in the paper):

L(ϕ)=1(K2)E(x,yw,yl)D ⁣[logσ(rϕ(x,yw)rϕ(x,yl))].\mathcal{L}(\phi) = -\frac{1}{\binom{K}{2}}\, \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\big[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \big].

The reward model is initialised from the 6B SFT model, with the language-modelling head replaced by a scalar head that reads the final-token hidden state. Importantly, all (K2)\binom{K}{2} pairs from a ranking are in one minibatch — this prevents overfitting that occurs when the same completion appears in many gradient updates as a single comparison.

KL-regularised PPO

Stage 3 maximises expected reward minus a per-token KL penalty back to the SFT policy. The objective is

J(θ)=ExD,yπθRL(x) ⁣[rϕ(x,y)βlog ⁣πθRL(yx)πSFT(yx)]+γExDpretrain ⁣[logπθRL(x)]J(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi^{\mathrm{RL}}_\theta(\cdot \mid x)}\!\Big[ r_\phi(x, y) - \beta \log\!\frac{\pi^{\mathrm{RL}}_\theta(y \mid x)}{\pi^{\mathrm{SFT}}(y \mid x)} \Big] + \gamma\, \mathbb{E}_{x \sim \mathcal{D}_{\mathrm{pretrain}}}\!\big[\log \pi^{\mathrm{RL}}_\theta(x)\big]

(this is the "PPO-ptx" variant; the paper also reports plain PPO with γ=0\gamma = 0). Three pieces matter:

  1. Reward signal. The first term is the trained reward model.
  2. KL anchor. The second term penalises deviation from the SFT policy; β\beta is the KL coefficient. Without this, PPO finds adversarial completions that score high on the reward model but are off-distribution.
  3. Pretraining mixture. The third term mixes in standard pretraining data; this offsets capability regressions on benchmarks the SFT data does not cover.

The KL is approximated as log(πRL/πSFT)-\log(\pi^{\mathrm{RL}}/\pi^{\mathrm{SFT}}) averaged over sampled tokens, which is a single-sample Monte Carlo estimate of KL(πRLπSFT)\mathrm{KL}(\pi^{\mathrm{RL}} \| \pi^{\mathrm{SFT}}). The optimisation uses PPO with clipped importance ratios; the clip range is the standard 0.2.

Why the KL penalty is necessary

Section 4.5 of the paper documents what happens without it: the policy quickly drives the reward model into out-of-distribution regions where its score is high but its calibration with human preference has collapsed. This is the canonical reward-hacking failure mode. The KL penalty bounds the divergence so the reward model is evaluated only on inputs near the data it was trained on. Choosing β\beta too small lets the policy drift; choosing it too large keeps the policy at the SFT solution. The paper sweeps it.

Reported headline results

InstructGPT-1.3B is preferred to GPT-3-175B by labelers 71.4% of the time (Section 4.1, Figure 1). A separate held-out evaluation finds InstructGPT hallucinates less, follows explicit constraints more often, and produces less toxic outputs on the RealToxicityPrompts benchmark, while losing a modest amount on a few public NLP benchmarks — the "alignment tax" the paper names. The pretraining mixture (PPO-ptx) cuts most of the tax without losing the preference advantage.

What It Gets Right

The first thing is the labelling protocol. Most of Appendices A–C describes how labelers are screened, trained, and instructed, and how disagreement is handled. Appendix B reports inter-annotator agreement (~73%), which sets a ceiling on what any preference-trained model can match. Almost all subsequent RLHF work cargo-culted the InstructGPT label format, instruction set, and rating scale; the paper is the standard reference for what a serious labelling pipeline looks like.

The second is the explicit naming of failure modes. Section 4.5 ("Limitations") and the discussion of the alignment tax are unusual for an industry paper of this period — they document where the method falls short rather than burying it. The alignment tax is now a recognised term and the pretraining mixture is the canonical mitigation.

The third is the size of the preference signal relative to scale. A 1.3B model trained with this recipe beats a 175B model trained without it. This is the empirical fact that pushed every public lab to invest in post-training infrastructure rather than only scaling pretraining.

Common Misconceptions

InstructGPT is not a model release. It is a recipe paper that describes a family of models (1.3B, 6B, 175B) trained with the same procedure. ChatGPT is a sibling of InstructGPT trained with a similar pipeline on different data, not the model in this paper.

The reward model is not the policy. Two networks are involved at PPO time: the policy πθRL\pi^{\mathrm{RL}}_\theta that is being trained, and the frozen reward model rϕr_\phi that scores its outputs. Confusing the two leads to questions like "why doesn't the policy just output the highest-scoring tokens directly?" — the answer is that it does, modulo the KL penalty, which is what makes the procedure work rather than collapse.

PPO is not the only choice. The paper used PPO because it was the most mature on-policy RL algorithm at the time. Later work showed that REINFORCE-style estimators (RLOO, GRPO) and direct preference optimisation (DPO) reach similar quality without the on-policy rollout cost. The recipe is "fine-tune on demonstrations, then optimise against preferences" — the specific RL algorithm is interchangeable.

Connections to TheoremPath Topics

Further Reading

  • Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS. arXiv:1706.03741. The paper that proposed Bradley-Terry reward learning from pairwise human comparisons in an RL setting; InstructGPT is the LLM-scale version of this idea.
  • Stiennon, N. et al. (2020). "Learning to Summarize with Human Feedback." NeurIPS. arXiv:2009.01325. Direct precursor at OpenAI; same three-stage pipeline applied to summarisation, and the source of most of the InstructGPT design.
  • Bai, Y. et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862. Anthropic's parallel implementation; documents the helpful/harmless tradeoff and the use of red-teamed adversarial prompts.
  • Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). "Direct Preference Optimization." NeurIPS. arXiv:2305.18290. Re-derives stages 2 and 3 as a single supervised loss, removing PPO from the pipeline.

References

Canonical:

  • Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS. arXiv:2203.02155.

Direct precursors:

  • Christiano, P. F. et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS. arXiv:1706.03741.
  • Stiennon, N. et al. (2020). "Learning to Summarize with Human Feedback." NeurIPS. arXiv:2009.01325.
  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347. The PPO algorithm used in stage 3.
  • Bradley, R. A., & Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika, 39(3/4). The pairwise-comparison model used for reward fitting.

Parallel and follow-on work:

  • Bai, Y. et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862. The Anthropic version.
  • Rafailov, R. et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS. arXiv:2305.18290.
  • Casper, S. et al. (2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback." TMLR. arXiv:2307.15217. A careful catalogue of where the InstructGPT recipe is fragile or overclaims.
  • Gao, L., Schulman, J., & Hilton, J. (2023). "Scaling Laws for Reward Model Overoptimization." ICML. arXiv:2210.10760. Quantifies how reward-model accuracy degrades as the policy drifts; the empirical case for the KL penalty.

Standard textbook:

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd edition. MIT Press. Chapter 13 (policy gradient) for the RL background; the InstructGPT post-training pipeline reduces to a policy-gradient method with a learned reward.

Connected topics

Last reviewed: May 7, 2026