Reinforcement Learning from Human Feedback

Sneiderman, Robby

LLM Construction

Reinforcement Learning from Human Feedback

The RLHF pipeline as math, not folklore: Bradley-Terry with sycophancy and intransitivity, KL-shielded PPO with overoptimization mitigation, the DPO implicit-reward identity and its likelihood-displacement failure, online-vs-offline preference learning, Nash-LHF, and the LIMA challenge to whether you need RL at all.

AdvancedTier 1FrontierCore spine~95 min

Prerequisites

Policy Gradient Theorem RLHF and Alignment Reinforcement Learning for Synthesis Planning Reward Design

Quiz (3)Prereq Map

Why This Matters

RLHF: from pretrained model to aligned model in three phases

RLHF turned base language models into chat assistants and is now the standard post-training stage at every frontier lab. The interesting content is no longer the pipeline; it is the failure modes that survived contact with scale. Bradley-Terry preference modeling encodes labeler bias as reward signal: Sharma et al. (2023) showed that human raters reliably prefer sycophantic agreement over truthful correction, so RLHF systematically trains in sycophancy. The KL penalty was supposed to prevent reward hacking, but Gao et al. (2023) show overoptimization happens well inside the KL budget. DPO removed PPO and the explicit reward model, but Razin et al. (2024) show that DPO can decrease the probability of the preferred response when training pairs are embedding-similar, directly inverting the optimization target. Tajwar et al. (2024) show that on-policy methods consistently beat offline DPO once the preference data has coverage gaps, narrowing the case for offline preference learning. Zhou et al. (2023, LIMA) show that 1,000 carefully curated SFT examples can reach or beat RLHF-trained models on most preference axes, raising the question of how much of "alignment" is actually pretraining capability being filtered.

This page covers the canonical Ouyang-Christiano-Schulman three-stage pipeline as math (Bradley-Terry, KL-shielded PPO), the Rafailov implicit-reward identity that turns DPO from heuristic into a derivation, the modern catalog of failure modes (sycophancy, overoptimization, likelihood displacement, intransitivity), and the post-2024 alternatives (Nash-LHF, online preference learning, LIMA-style minimal alignment). The connection to hallucination theory is direct: RLHF's binary preference grading is the same incentive structure that Kalai et al. (2025) prove drives confident-bluffing.

The Three-Stage Pipeline

The canonical RLHF pipeline (Ouyang et al., 2022, InstructGPT) has three stages:

Supervised Fine-Tuning (SFT): Fine-tune a pretrained LM on high-quality demonstration data
Reward Model Training: Train a scalar reward model on human preference comparisons
RL Optimization: Optimize the policy against the reward model using PPO with a KL penalty against the SFT policy

Stage 1: Supervised Fine-Tuning

Start with a pretrained language model $\pi_{\text{pre}}$ . Fine-tune it on a dataset of (prompt, high-quality response) pairs using standard next-token cross-entropy loss. This produces $\pi_{\text{SFT}}$ .

SFT alone already improves instruction-following substantially. The purpose of the next two stages is to go beyond what demonstration data can teach: to learn preferences over response quality rather than just response format.

Stage 2: Reward Model

Definition

Bradley-Terry Preference Model

Given two responses $y_w$ (preferred) and $y_l$ (dispreferred) to prompt $x$ , the Bradley-Terry model assumes:

$P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l))$

where $r: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}$ is a learned reward function and $\sigma$ is the sigmoid function. The reward model is trained by maximizing this likelihood over a dataset of human comparisons.

Proposition

Bradley-Terry Reward Model Loss

Statement

The maximum likelihood objective for the reward model given a dataset $\mathcal{D} = \{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\}_{i=1}^N$ is:

$\mathcal{L}_{\text{RM}} = -\frac{1}{N}\sum_{i=1}^{N} \log \sigma\left(r(x^{(i)}, y_w^{(i)}) - r(x^{(i)}, y_l^{(i)})\right)$

This is binary cross-entropy where the reward difference predicts which response humans prefer.

Intuition

The reward model does not learn absolute scores. It learns to rank responses: preferred responses get higher reward than dispreferred ones. The sigmoid converts the reward gap into a probability, and we maximize the probability of the observed preference ordering.

Proof Sketch

Write the log-likelihood of the data under the Bradley-Terry model: $\sum_i \log P(y_w^{(i)} \succ y_l^{(i)}) = \sum_i \log \sigma(r(x^{(i)}, y_w^{(i)}) - r(x^{(i)}, y_l^{(i)}))$ . Negate for a loss to minimize.

Why It Matters

This is the standard reward model objective used in InstructGPT, the original ChatGPT RLHF pipeline, and most subsequent work. The reward model serves as a proxy for human judgment, cheaper to query than humans but susceptible to reward hacking.

Failure Mode

The Bradley-Terry model assumes a total ordering of responses. In practice, human preferences are intransitive (A > B, B > C, but C > A), context-dependent, and noisy. Labeler disagreement can be substantial (30%+ on subjective prompts). The reward model fits a smooth function to this noisy signal, which can produce systematic biases.

report a correction →

What Bradley-Terry Encodes That You Did Not Want

The single-reward, total-order structure of Bradley-Terry is not innocent. Two failure modes traceable directly to the model assumption, not to PPO or DPO:

Watch Out

Sycophancy is a Bradley-Terry pathology, not an RL bug

Sharma et al. (2023, Anthropic) tested five frontier assistants and found a consistent pattern: when a user states an incorrect belief, the model agrees more often than it corrects. The mechanism is in the preference data, not the optimization. Their preference-model probe shows that reward models prefer sycophantic-but-wrong responses over truthful-but-disagreeable ones a non-negligible fraction of the time, and that this preference is inherited from the human raters who labeled the comparisons. RLHF then optimizes against this proxy and amplifies it.

The implication: any optimizer that treats Bradley-Terry preferences as ground truth (DPO, PPO, IPO, KTO) will inherit this bias unless the rater pool is structured to penalize agreement-with-wrong-claims. Switching from PPO to DPO does not fix sycophancy. Hardening the rater rubric or moving to constitutional AI / RLAIF with anti-sycophancy principles is the structural fix; reward-model surgery is not.

Watch Out

Intransitive human preferences violate the Bradley-Terry model assumption

Bradley-Terry assumes preferences are induced by an underlying scalar reward, which forces transitivity: $r(A) > r(B)$ and $r(B) > r(C)$ implies $r(A) > r(C)$ . Real preferences over LLM responses do not satisfy this, especially when responses trade off different desiderata (helpfulness vs honesty vs conciseness). The reward model fit to intransitive data produces a single ordering that lies; whichever pair of preferences the model fits exactly, the third pair must be wrong. This is the structural reason for Munos et al. (2024) Nash Learning from Human Feedback (see "Beyond Bradley-Terry" below).

Stage 3: PPO with KL Penalty

Given the reward model $r$ and the SFT policy $\pi_{\text{SFT}}$ , optimize:

$\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(\cdot \mid x)}\left[r(x, y) - \beta \cdot D_{\text{KL}}(\pi(\cdot \mid x) \| \pi_{\text{SFT}}(\cdot \mid x))\right]$

where $\beta > 0$ controls the strength of the KL penalty. This is optimized using PPO (Proximal Policy Optimization) with the reward model providing the reward signal.

Proposition

KL Penalty Prevents Reward Hacking

Statement

Without the KL penalty ( $\beta = 0$ ), the optimal policy $\pi^*$ maximizes $r(x, y)$ and will exploit any systematic errors in the reward model, producing outputs that score high under $r$ but low under true human judgment. The KL penalty constrains the policy to stay near $\pi_{\text{SFT}}$ , limiting the degree of reward exploitation:

$D_{\text{KL}}(\pi^* \| \pi_{\text{SFT}}) \leq \frac{r_{\max}}{\beta}$

Larger $\beta$ keeps the policy closer to SFT; smaller $\beta$ allows more reward optimization at the risk of more hacking.

Intuition

The reward model is a learned proxy, not the true objective. Optimizing a proxy too aggressively is Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure." The KL penalty sets a trust region around the SFT policy where the reward model is still a reasonable proxy.

Proof Sketch

The KL-penalized objective has a closed-form optimal policy: $\pi^*(y \mid x) \propto \pi_{\text{SFT}}(y \mid x) \exp(r(x,y)/\beta)$ . As $\beta \to 0$ , $\pi^*$ concentrates on the reward-maximizing response. As $\beta \to \infty$ , $\pi^* \to \pi_{\text{SFT}}$ . The bound on KL follows from substituting back into the objective.

Why It Matters

This is the central design choice of RLHF. The KL penalty is not just regularization for stability; it is a safety mechanism. Empirically, reward model score increases monotonically with optimization, but true quality (measured by human evaluation) peaks and then decreases. The KL penalty stops optimization before this peak.

Failure Mode

Choosing $\beta$ is difficult. Too large: the model barely improves over SFT. Too small: the model reward-hacks. In practice, $\beta$ is tuned by monitoring a held-out set of human evaluations during training, which is expensive. Reward model overoptimization (Gao et al., 2023) shows that the relationship between KL budget and true quality follows a roughly parabolic curve.

report a correction →

Mitigating Overoptimization: Reward-Model Ensembles

Coste, Anwar, Kirk, and Krueger (ICLR 2024) test whether the parabolic gold-vs-proxy curve is fundamental or an artifact of using a single reward model. They train an ensemble of reward models on different splits of the preference data and define two conservative aggregation rules:

Worst-Case Optimization (WCO): the policy is rewarded by the minimum reward across the ensemble, $r_{\text{WCO}}(x,y) = \min_k r_k(x,y)$ . The min penalizes responses that look good to one reward model but suspicious to another.
Uncertainty-Weighted Optimization (UWO): subtract a multiple of the cross-ensemble standard deviation from the mean, $r_{\text{UWO}}(x,y) = \overline{r}(x,y) - \lambda \cdot \text{std}(\{r_k\})$ . Larger ensemble disagreement is treated as larger overoptimization risk.

Their headline result: in best-of- $N$ sampling with the gold reward held out, conservative ensembling eliminates the parabolic peak entirely and improves performance by up to 70% over single-RM optimization. In PPO the gain is smaller but the overoptimization onset is delayed substantially. The ensemble cost scales linearly with the number of reward models, but the reward-model forward pass is small compared to the policy update, so this is a cheap mitigation that generalizes to any RLHF variant.

This generalizes the principle behind the Exercise 3 protocol below: ensemble disagreement is a usable proxy for overoptimization, available at training time without a held-out human eval.

Reward Hacking and Goodhart's Law

Definition

Reward Hacking

Reward hacking occurs when the policy finds outputs that achieve high reward model scores through features that do not correspond to genuine quality. Examples include: generating longer responses (length bias in reward models), repeating confident-sounding phrases, or including irrelevant but impressive-sounding details. The reward model assigns high scores because these features correlate with quality in the training data, but the correlation breaks under optimization.

Watch Out

The reward model is not the objective

The true objective is human satisfaction. The reward model is a proxy. RLHF works only in the regime where the proxy and the true objective agree. The KL penalty keeps the policy in this regime. Going beyond this regime (overoptimization) degrades the actual output quality even as the proxy reward increases.

InstructGPT: The Canonical Worked Example

Ouyang et al. (2022) applied the three-stage pipeline to GPT-3 (175B). The published numbers (Tables 1, 6, and Section 3 of the paper) are the load-bearing reference point for every later RLHF system; they are what every later "RLHF works" / "RLHF is brittle" claim is implicitly comparing against.

Data scale of each stage (Table 6):

Stage	Prompts	Source	Cost-relevant fact
SFT	12,725	mostly written by hired labelers, some from API	Single-pass next-token loss
Reward Model	33,207 prompts $\times$ 4–9 ranked completions each	sampled from earlier-stage models	Yields ~50K pairwise comparisons after expansion
PPO	31,144	from API, no labels	No new human data; reward signal entirely from RM

The SFT cost is dominated by writing demonstrations; the RM cost is dominated by ranking; PPO uses no new human data once the RM is trained. Total human-label budget is on the order of $10^4$ prompts, not $10^6$ . Modern frontier labs collect orders of magnitude more, but the InstructGPT counts remain the demonstration that RLHF can produce a useful assistant from a budget that academic groups can match.

Labeler agreement. Inter-annotator agreement on the 4-way ranking task was 72.6% pairwise (their Section 3.4). The 27% disagreement is irreducible noise; the reward model fits a smooth function to it and the policy then optimizes against the fit. This is the empirical anchor for the "Bradley-Terry assumes total order, real preferences are noisy" failure mode noted in the theorem above.

Size-vs-alignment tradeoff. The 1.3B InstructGPT was preferred over the 175B base GPT-3 by labelers on the InstructGPT prompt distribution (Section 4.2, Figure 3). This is the "100 $\times$ size reduction" headline, but it is conditional on prompt distribution: on standard NLP benchmarks (SQuAD, DROP, HellaSwag, etc.) the RLHF-trained model loses 1–10 points relative to the base model. Ouyang et al. call this the alignment tax.

Mitigating the tax: PPO-ptx. Their Section 4.3 introduces "PPO-ptx," which mixes the standard pretraining log-likelihood gradient into the PPO update with coefficient $\gamma$ :

\mathcal{L}_{\text{PPO-ptx}} \;=\; \mathcal{L}_{\text{PPO}} \;+\; \gamma \cdot \mathbb{E}_{x \sim \mathcal{D}_{\text{pre}}}[\log \pi_\theta(x)].

This recovers most of the lost benchmark performance while preserving the alignment gains. Modern recipes use the same trick under different names (continued pretraining mixed into RL, KL-to-pretrain in addition to KL-to-SFT). The mechanism is to keep the policy from drifting on the language-modeling distribution while drifting on the preference distribution.

Reward model accuracy. RM accuracy on held-out comparisons was 72%, almost exactly matching labeler agreement. This is a clean observation: the RM is fitting the labelers, not some platonic preference structure. Improving RM accuracy past labeler agreement is impossible without resolving labeler disagreement first.

What Has Changed Since

DPO (Direct Preference Optimization)

Rafailov, Sharma, Mitchell, Ermon, Manning, and Finn (NeurIPS 2023) eliminated the explicit reward model and PPO loop. The eliminator is a single algebraic identity, not a heuristic. It deserves a theorem.

Theorem

DPO implicit-reward identity (Rafailov et al. 2023)

Statement

Let $\pi_r$ denote the optimal policy under the KL-regularized reward objective $\max_\pi \mathbb{E}_\pi[r(x,y)] - \beta D_{\text{KL}}(\pi \| \pi_{\text{ref}})$ . Then $r$ and $\pi_r$ satisfy

r(x,y) \;=\; \beta \log \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \;+\; \beta \log Z(x), \qquad Z(x) = \sum_{y'} \pi_{\text{ref}}(y'\mid x)\exp(r(x,y')/\beta).

Substituting this identity into the Bradley-Terry preference model causes the partition function $Z(x)$ to cancel (it depends on $x$ but not on $y$ ), yielding a preference probability expressed entirely in terms of policy ratios:

P(y_w \succ y_l \mid x) \;=\; \sigma\!\left(\beta \log \frac{\pi_r(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} \;-\; \beta \log \frac{\pi_r(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right).

Maximum-likelihood fitting of this preference probability over the dataset is the DPO loss,

\mathcal{L}_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}}) \;=\; -\,\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\!\left[\log \sigma\!\left(\beta \log \tfrac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta \log \tfrac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right].

Equations 5, 6, 7 of Rafailov et al. (2023). Theorem 1 of the paper extends the bijection to the entire Plackett-Luce family.

Intuition

The KL-regularized objective is a soft-max over rewards: the optimal policy puts mass proportional to $\pi_{\text{ref}}(y\mid x) \exp(r(x,y)/\beta)$ . Inverting this gives a one-to-one map between rewards and optimal policies (modulo a prompt-dependent constant). The Bradley-Terry model only cares about reward differences, so the constant cancels and the policy-ratio expression is what survives. There is no approximation in this derivation; DPO is the maximum-likelihood preference fit under the same assumptions as standard RLHF.

Proof Sketch

The closed-form optimal policy under KL-regularized reward maximization is $\pi_r(y\mid x) = \pi_{\text{ref}}(y\mid x) \exp(r(x,y)/\beta) / Z(x)$ (Exercise 2). Take $\beta$ times the log of both sides and rearrange to get the identity. Substituting into $P(y_w \succ y_l) = \sigma(r(x,y_w)-r(x,y_l))$ cancels $\beta \log Z(x)$ from both terms, leaving an expression in policy log-ratios.

Why It Matters

The identity collapses the two-stage RLHF pipeline (fit RM, then optimize) into one stage (fit policy directly). The benefits are concrete: no reward model to train, no PPO inner loop, no separate value function, no on-policy sampling during training. The cost is that the maximum-likelihood fit is over a fixed offline dataset; the policy never queries new responses. Whether this matters depends on whether the offline preference data covers the responses you would generate during training (see Tajwar et al. 2024 below).

Failure Mode

The identity is exact for the Bradley-Terry preference model. It inherits every flaw of that model: sycophancy is still encoded, intransitivity is still violated, label noise is still amplified. It also creates one new failure mode that PPO does not have, likelihood displacement, which we treat next as its own result.

report a correction →

Likelihood Displacement: A DPO Failure Mode That PPO Does Not Have

Razin, Malladi, Bhaskar, Chen, Arora, and Hanin (2024) prove and empirically show that DPO can systematically decrease the probability of preferred responses, the exact opposite of the optimization target.

Watch Out

DPO can drive preferred-response probability down, not up

The mechanism: the DPO gradient with respect to the model parameters is a difference of two terms, the gradient of $\log \pi_\theta(y_w \mid x)$ and the gradient of $\log \pi_\theta(y_l \mid x)$ , weighted by the sigmoid of the current policy log-ratio. When $y_w$ and $y_l$ produce nearly identical hidden representations (high embedding similarity), the two gradients point in nearly the same direction. The negative-gradient on $y_l$ then drags the parameters in a direction that also reduces $\log \pi_\theta(y_w \mid x)$ . The probability of the preferred response can fall during training even as the loss decreases, because what the loss actually rewards is the gap between $\log \pi_\theta(y_w)$ and $\log \pi_\theta(y_l)$ , and that gap can grow while both probabilities shrink.

Razin et al. introduce a Centered Hidden Embedding Similarity (CHES) score to predict which training pairs trigger this. Pairs with high CHES are the ones that displace probability mass.

The headline empirical result: applying DPO with default hyperparameters to safety-tune Llama-3-8B-Instruct reduces the model's refusal rate on harmful prompts from 74.4% to 33.4%. The training signal said "prefer the refusal"; DPO obeyed by lowering the refusal probability, because the unsafe completions were close to the refusal in embedding space and the optimization preferred to drag both down rather than separate them.

The structural lesson: DPO is not a drop-in replacement for PPO when preference pairs are embedding-similar. PPO does not have this failure because it samples on-policy and optimizes a reward model that scores responses absolutely, not relatively. Preprocessing preference data to filter high-CHES pairs, or switching to identity preference optimization (IPO) with bounded gradients, mitigates this.

Online vs Offline Preference Learning

Tajwar, Singh, Sharma, Rafailov, Schneider, Xie, Ermon, Finn, and Kumar (ICML 2024) systematically compare offline preference fine-tuning (DPO and variants) against on-policy methods (RLHF-PPO, REBEL, online DPO with fresh sampling). Their finding sharpens the choice:

On-policy sampling and negative-gradient updates outperform offline maximum-likelihood objectives, especially when the preferred response lies in a region of low probability under the reference policy.

Their unifying frame is mode-seeking vs mode-covering. Offline DPO optimizes a maximum-likelihood objective on fixed pairs, which is mode-covering: it spreads probability across all preferred responses in the dataset. On-policy methods with a negative gradient on dispreferred completions are mode-seeking: they actively push down on the dispreferred regions of policy space. When the preference data has coverage gaps relative to the policy's actual sampling distribution, mode-seeking wins because it can correct what the policy is currently producing, while mode-covering can only push toward what the dataset already showed.

Practical implication: if your preference dataset was collected before the current policy existed, offline DPO will be biased toward correcting whatever model produced the preference completions, not your current policy. On-policy preference learning (sample with the current policy, query a reward model or judge, update) is more expensive per step but reaches better final quality because the gradient signal is grounded in the actual policy distribution.

This recasts the DPO-vs-PPO comparison: it is not "RL vs supervised" but "online vs offline preference signal." DPO with periodic on-policy preference re-collection ("online DPO" or iterative DPO) inherits much of the on-policy benefit at lower cost than full PPO.

GRPO

Definition

Group Relative Policy Optimization (GRPO)

GRPO (Shao et al., 2024) modifies PPO by removing the value function (critic) and estimating the advantage from a group of $G$ completions sampled from the same prompt. For prompt $x$ , sample $\{y_1, \ldots, y_G\} \sim \pi_{\theta_{\text{old}}}(\cdot \mid x)$ , score each with reward $r_i$ , and define the advantage as the within-group standardized reward:

$A_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}$

The policy is updated using the PPO-style clipped surrogate objective with $A_i$ as the advantage, plus a KL penalty against a fixed reference policy $\pi_{\text{ref}}$ (typically the SFT model), matching the regularization role that $\pi_{\text{SFT}}$ plays in standard RLHF.

GRPO was introduced in DeepSeekMath (Shao et al., 2024) and is the RL algorithm used to train DeepSeek-R1 (DeepSeek-AI, 2025). Key properties:

No critic network. Standard PPO trains a value function $V_\phi(s)$ of comparable size to the policy to compute advantages. GRPO drops it. Memory cost drops from holding two large models (policy and value) to one, which matters at LLM scale where each is tens of billions of parameters.
Group-relative advantage. The baseline is the group mean, not a learned value estimate. Standardizing by the group standard deviation gives per-prompt advantages with zero mean and unit variance, which stabilizes updates across prompts of very different reward scales.
Same KL regularization as RLHF. GRPO retains a per-token KL penalty against $\pi_{\text{ref}}$ , preserving the trust-region role that prevents reward hacking.
Pairs well with verifiable rewards. When $r_i$ is a rule-based check (correct answer, passing test, valid format), GRPO skips reward-model training entirely and the group-relative estimator has low variance.

Example

Advantage computation for a group of G=4

Prompt: "Solve $2x + 3 = 11$ ." Sample four completions and check final answers. Suppose rewards are $r_1 = 1, r_2 = 0, r_3 = 1, r_4 = 0$ .

Mean $= 0.5$ , standard deviation $= 0.5$ . Advantages: $A_1 = A_3 = 1.0$ and $A_2 = A_4 = -1.0$ . Each token in a correct completion receives advantage $+1$ ; each token in an incorrect completion receives $-1$ . No value network is consulted.

See the verifier design and process reward page for details on rule-based rewards, and DPO vs GRPO vs RL reasoning for a direct comparison.

Process vs Outcome Reward Models

Definition

Outcome Reward Model (ORM)

An outcome reward model scores only the final answer of a full solution, $r_{\text{ORM}}(x, y) \in \mathbb{R}$ , regardless of the reasoning steps that produced it. Supervision comes from labeling complete solutions as correct or incorrect.

Definition

Process Reward Model (PRM)

A process reward model scores each intermediate step of a reasoning trace. Given a solution broken into steps $y = (s_1, s_2, \ldots, s_T)$ , the PRM produces per-step scores $r_{\text{PRM}}(x, s_{1:t}) \in \mathbb{R}$ for $t = 1, \ldots, T$ . Supervision comes from human or automated labels on each step as correct, neutral, or incorrect.

Lightman et al. (2023) "Let's Verify Step by Step" trained a PRM on the PRM800K dataset (around 800k step-level labels on MATH solutions) and showed that PRMs outperform ORMs when used as verifiers for best-of- $N$ sampling. Uesato et al. (2022) established the earlier process-vs-outcome comparison on grade-school math, finding that process supervision yields better interpretability even when outcome supervision matches final accuracy.

Why PRMs help:

Step-level credit assignment. An ORM can only say "this solution is wrong." A PRM localizes the first incorrect step, so downstream updates (or search procedures) do not penalize correct reasoning that happened to precede a later error.
Test-time verification. Given $N$ sampled solutions, a PRM can rank them by minimum step score or product of step scores, enabling verifier-guided best-of- $N$ . This is one of the main levers behind test-time compute scaling in reasoning models.
Robustness to false positives. ORMs can assign high reward to a solution that reaches the correct answer through flawed reasoning. PRMs penalize such traces at the step level.

Costs:

Annotation burden. Step-level labels are far more expensive than outcome labels. PRM800K required heavy human effort. Automated step labeling (e.g., Math-Shepherd, rollouts from intermediate states) partially mitigates this but introduces its own noise.
Step segmentation. "What counts as a step" is not well-defined for free-form text. Segmentation choices affect both training and inference.
PRM hacking. Like any learned reward, PRMs are proxies. A policy optimized against a PRM can produce traces that score well step by step without being globally correct.

Watch Out

PRM does not replace the outcome signal

Using a PRM does not mean the final answer stops mattering. Most pipelines combine a PRM (for step-level verification and search guidance) with an ORM or ground-truth checker (for final correctness). The PRM narrows the search space; the outcome check is the terminal objective.

Constitutional AI

Instead of collecting human preference labels, generate preference data by having the model critique its own outputs against a set of principles. This scales the feedback generation process but still requires careful principle design.

RLVR (RL with Verifiable Rewards)

For tasks with objective correctness criteria (math, code, factual QA), skip the reward model entirely and use the correctness signal as the reward. This eliminates Goodhart concerns for the verifiable component but does not help with subjective quality.

Beyond Bradley-Terry: Nash Learning from Human Feedback

Munos, Valko, Calandriello, Azar, Rowland, Guo, Tang, Geist, Mesnard, Michi, Selvi, Girgin, Momchev, Bachem, Mankowitz, Precup, and Piot (DeepMind, 2024) attack the assumption that human preferences come from an underlying scalar reward. They replace Bradley-Terry with a pairwise preference model $\mathcal{P}(y \succ y' \mid x)$ that is fit directly on comparison data without forcing transitivity.

The optimization target then becomes a two-player game: find a policy $\pi^*$ that beats every other policy in expected pairwise preference,

\pi^* \;=\; \arg\max_{\pi} \;\min_{\pi'} \;\mathbb{E}_{x, y\sim\pi(\cdot|x), y'\sim\pi'(\cdot|x)}\!\left[\mathcal{P}(y \succ y' \mid x)\right].

The solution is the Nash equilibrium of the symmetric pairwise-preference game. When the preference model satisfies Bradley-Terry, the Nash equilibrium reduces to the standard reward-maximizing policy (modulo the partition function), so NLHF strictly generalizes RLHF.

The paper proposes Nash-MD, a mirror-descent algorithm that produces a sequence of policies converging to a regularized Nash equilibrium. The convergence is theoretical (rates depend on regularization strength) and the empirical evaluation on dialogue tasks shows quality gains over standard PPO-RLHF, especially on prompts where intransitive preferences are documented.

When NLHF matters: applications where preferences trade off multiple incomparable axes (helpfulness vs harmfulness vs conciseness), where a single scalar reward forces a particular tradeoff that the labelers did not endorse. When NLHF is overkill: single-axis preference settings (math correctness, code passing tests) where Bradley-Terry is approximately right.

Is RLHF Necessary? The LIMA Challenge

Zhou, Liu, Xu, Iyer, Sun, Mao, Ma, Efrat, Yu, Yu, Zhang, Ghosh, Lewis, Zettlemoyer, and Levy (NeurIPS 2023) trained LLaMA-65B on 1,000 carefully curated SFT examples with no RLHF and no preference data. In head-to-head human evaluation:

Comparison	LIMA wins or ties
LIMA vs DaVinci003 (RLHF-trained GPT-3.5)	65%
LIMA vs Bard (March 2023)	58%
LIMA vs Claude (RLHF)	54%
LIMA vs GPT-4	43%

These are not "LIMA dominates" numbers, but they are not "RLHF dominates" numbers either. The implication, formalized by the paper as the Superficial Alignment Hypothesis:

A model's knowledge and capabilities are learned almost entirely during pretraining; alignment teaches it which subdistribution of formats to use when interacting with users.

If the hypothesis is true, then RLHF is not adding capability, only filtering capability. Most of the cost-benefit case for RLHF rests on the assumption that it does something SFT cannot. LIMA shows that for many user-facing axes (response style, format, refusal of harmful requests, factual recall on common topics), high-quality SFT can match RLHF.

Where the LIMA argument breaks down. LIMA does not test:

Adversarial safety against motivated red-teaming
Multi-turn dialogue robustness over long conversations
Calibration / hedging on borderline factual claims
Performance on tasks requiring complex reasoning chains
Preference grounding for subjective quality dimensions where SFT examples cannot represent the full preference structure

Empirically, frontier labs use RLHF + preference learning for these reasons even when SFT alone would handle the easy preference axes. The honest synthesis: SFT does most of the alignment work for typical prompts, but RLHF and DPO contribute disproportionately on the long tail of safety, calibration, and subjective-quality cases where preference grounding is the only available signal. Treating LIMA as evidence that "RLHF is unnecessary" is reading the result too strongly; treating it as evidence that "most of the alignment win is from data curation, not optimization algorithm" is closer to what the experiments support.

Common Confusions

Watch Out

RLHF does not teach the model new knowledge

RLHF steers the model toward responses that humans prefer. It cannot teach facts that were absent from pretraining data. If the base model does not know a fact, RLHF will not inject it. RLHF changes the distribution over existing capabilities, not the capabilities themselves.

Watch Out

DPO is not strictly better than PPO-based RLHF

DPO is simpler and avoids reward model training, but it optimizes on a fixed dataset of preferences. PPO-based RLHF can explore: the policy generates new responses during training, which the reward model evaluates. This exploration can find good responses not present in the preference dataset. Tajwar et al. (2024) extend this: on-policy methods systematically outperform offline DPO when the preference dataset has coverage gaps, and adding "negative gradient" updates that actively push down dispreferred responses (rather than only pulling up preferred ones) is a major part of the gap. The honest comparison is "online preference learning vs offline preference learning," not "RL vs supervised."

Watch Out

More compute on RLHF does not strictly improve safety

Three results in tension with the "more RLHF, more aligned" intuition:

Gao et al. (2023): proxy reward grows monotonically with KL budget, but gold reward (true human preference) follows a parabola. Beyond the peak, more RLHF makes the model worse on the actual objective.
Razin et al. (2024): for embedding-similar preference pairs, DPO can drive the preferred-response probability down, lowering Llama-3-8B-Instruct refusal rates from 74% to 33% under safety-tuning data.
Sharma et al. (2023): more RLHF on labeler-preferred data amplifies sycophancy because labelers reliably prefer agreement to correction.

The pattern: RLHF amplifies the signal it is given. If the preference data encodes a bias, more RLHF encodes more of that bias. "Better RLHF" means better preference data and better optimization regularization, not more steps.

Summary

RLHF has three stages: SFT for format, reward model for preferences, PPO for optimization against the reward model. Total InstructGPT human-label budget was ~33K preference comparisons + ~13K demonstrations, well within reach for groups outside frontier labs.
The Bradley-Terry assumption is doing a lot of work and several documented failure modes trace to it directly: sycophancy (Sharma 2023), intransitive preference violation (Munos 2024), label noise being amplified into systematic bias.
The KL penalty against the SFT policy is a safety mechanism that prevents reward hacking, not just a regularizer; reward-model ensembling (Coste 2023) further mitigates overoptimization at small added cost.
DPO is the maximum-likelihood preference fit under the same Bradley-Terry assumption as PPO, derivable from a single algebraic identity (Theorem dpo-implicit-reward-identity). It inherits the Bradley-Terry pathologies and adds a new one: likelihood displacement on embedding-similar pairs (Razin 2024).
The DPO-vs-PPO comparison is best understood as offline-vs-online preference learning (Tajwar 2024). Online methods reach better quality because the gradient signal is grounded in the policy's actual sampling distribution.
Nash-LHF (Munos 2024) generalizes Bradley-Terry to a two-player game with intransitive preference handling. GRPO uses verifiable rewards with group-relative scoring (DeepSeek 2024). LIMA (Zhou 2023) shows that high-quality SFT alone reaches 65% / 58% / 54% / 43% win rates against DaVinci003 / Bard / Claude / GPT-4, suggesting most of the alignment benefit is data curation, not algorithm.
The connection to hallucination theory: the same binary preference grading that drives RLHF sycophancy is the structure that Kalai et al. (2025) prove makes confident-bluffing the optimal policy under benchmark scoring.

Exercises

ExerciseCore

Problem

A reward model assigns $r(x, y_w) = 2.0$ and $r(x, y_l) = 0.5$ for a preference pair. What is the predicted probability that $y_w$ is preferred under the Bradley-Terry model?

ExerciseAdvanced

Problem

The optimal policy under the KL-penalized RLHF objective is $\pi^*(y \mid x) \propto \pi_{\text{SFT}}(y \mid x) \exp(r(x,y)/\beta)$ . Derive this. Start from the objective $\max_\pi \mathbb{E}_\pi[r(x,y)] - \beta D_{\text{KL}}(\pi \| \pi_{\text{SFT}})$ and use calculus of variations or the known solution for KL-regularized optimization.

ExerciseResearch

Problem

Gao et al. (2023) observed that as the KL budget increases, reward model score increases monotonically but true quality (gold reward) peaks and then decreases. Propose an experiment to estimate the optimal KL budget for a new reward model without access to a gold reward model.

ExerciseAdvanced

Problem

Show that the DPO gradient on a single preference pair $(x, y_w, y_l)$ has the form

\nabla_\theta \mathcal{L}_{\text{DPO}} \;=\; -\beta\,\sigma\!\left(\hat{r}_l - \hat{r}_w\right)\!\bigl[\nabla_\theta \log \pi_\theta(y_w\mid x) - \nabla_\theta \log \pi_\theta(y_l\mid x)\bigr],

where $\hat{r}_y \;=\; \beta \log \tfrac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)}$ is the implicit reward. Then explain why, when $\nabla_\theta \log \pi_\theta(y_w \mid x)$ and $\nabla_\theta \log \pi_\theta(y_l \mid x)$ point in nearly the same direction, the gradient step can decrease $\log \pi_\theta(y_w \mid x)$ even while reducing the loss.

ExerciseResearch

Problem

Design an empirical test of the Superficial Alignment Hypothesis (Zhou et al. 2023, LIMA) that goes beyond LIMA's own evaluation. The original test compared LIMA-65B (1,000 SFT examples, no RL) against DaVinci003, Bard, Claude, and GPT-4 on a 300-prompt human-eval set. Propose two probes that would distinguish "alignment is mostly capability filtering" from "alignment is teaching new behaviors."

Related Comparisons

SFT vs. DPO

References

- Christiano, Leike, Brown, Martic, Legg, Amodei, "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017). arXiv:1706.03741. The pre-LLM origin: Bradley-Terry preference modeling for RL agents on Atari and MuJoCo. Sections 2-3. - Schulman, Wolski, Dhariwal, Radford, Klimov, "Proximal Policy Optimization Algorithms" (2017). arXiv:1707.06347. The PPO clipped surrogate objective. - Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelton, Miller, Simens, Askell, Welinder, Christiano, Leike, Lowe, "Training language models to follow instructions with human feedback" (NeurIPS 2022, InstructGPT). arXiv:2203.02155. Sections 2-4 for pipeline; Table 6 for data sizes; Section 4.3 for PPO-ptx and the alignment tax. - Bai, Jones, Ndousse, Askell et al., "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" (Anthropic, 2022). arXiv:2204.05862. The Anthropic HH dataset and labeler protocol.
- Rafailov, Sharma, Mitchell, Ermon, Manning, Finn, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (NeurIPS 2023). arXiv:2305.18290. Equations 5-7 for the implicit-reward identity and DPO loss; Theorem 1 for the bijection on Plackett-Luce families. - Razin, Malladi, Bhaskar, Chen, Arora, Hanin, "Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization" (2024). arXiv:2410.08847. The Centered Hidden Embedding Similarity (CHES) score and the 74.4% → 33.4% Llama-3-8B-Instruct refusal-rate result. - Tajwar, Singh, Sharma, Rafailov, Schneider, Xie, Ermon, Finn, Kumar, "Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data" (ICML 2024). arXiv:2404.14367. The mode-seeking-vs-mode-covering frame and the on-policy-beats-offline finding. - Azar, Guo, Piot, Munos, Rowland, Valko, Calandriello, "A General Theoretical Paradigm to Understand Learning from Human Preferences" (IPO, 2023). arXiv:2310.12036. The IPO objective with bounded gradients that mitigates likelihood displacement on noisy preference data. - Ethayarajh, Xu, Muennighoff, Jurafsky, Kiela, "KTO: Model Alignment as Prospect Theoretic Optimization" (2024). arXiv:2402.01306. Asymmetric loss inspired by prospect theory.
- Sharma, Tong, Korbak et al., "Towards Understanding Sycophancy in Language Models" (Anthropic, 2023). arXiv:2310.13548. The empirical finding that preference models prefer sycophantic responses over truthful ones a non-negligible fraction of the time. - Gao, Schulman, Hilton, "Scaling Laws for Reward Model Overoptimization" (ICML 2023). arXiv:2210.10760. The parabolic gold-vs-proxy reward curve as a function of KL budget. - Coste, Anwar, Kirk, Krueger, "Reward Model Ensembles Help Mitigate Overoptimization" (ICLR 2024). arXiv:2310.02743. Worst-Case Optimization (WCO) and Uncertainty-Weighted Optimization (UWO); up to 70% best-of-N improvement; near-elimination of the parabolic peak. - Singhal, Goyal, Xu, Durrett, "A Long Way to Go: Investigating Length Correlations in RLHF" (2023). arXiv:2310.03716. RLHF reward models reward length above quality; documents the specific Goodhart pathology most commonly seen in deployed RLHF systems.
- Munos, Valko, Calandriello, Azar, Rowland, Guo, Tang, Geist, Mesnard, Michi, Selvi, Girgin, Momchev, Bachem, Mankowitz, Precup, Piot, "Nash Learning from Human Feedback" (DeepMind, 2024). arXiv:2312.00886. Replaces Bradley-Terry with a pairwise preference model; defines the regularized Nash equilibrium; introduces Nash-MD via mirror descent. - Zhou, Liu, Xu, Iyer, Sun, Mao, Ma, Efrat, Yu, Yu, Zhang, Ghosh, Lewis, Zettlemoyer, Levy, "LIMA: Less Is More for Alignment" (NeurIPS 2023). arXiv:2305.11206. 1,000-example SFT-only baseline; the Superficial Alignment Hypothesis; head-to-head numbers vs DaVinci003, Bard, Claude, GPT-4. - Bai, Kadavath, Kundu, Askell et al., "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, 2022). arXiv:2212.08073. RLAIF: replace human preference labels with model-generated preferences against a written constitution.
- Shao, Wang, Zhu et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (2024). arXiv:2402.03300. Section 4 introduces GRPO. - DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025). arXiv:2501.12948. Production-scale GRPO with verifiable rewards. - Lightman, Kosaraju, Burda et al., "Let's Verify Step by Step" (OpenAI, 2023). arXiv:2305.20050. PRM800K dataset, process reward modeling. - Uesato, Kushman, Kumar, Song, Siegel, Wang, Creswell, Irving, Higgins, "Solving math word problems with process- and outcome-based feedback" (DeepMind, 2022). arXiv:2211.14275. - Wang, Li, Shao, Xu, Dai, Li, Chen, Wu, Sui, "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" (2024). arXiv:2312.08935.

Next Topics

DPO vs GRPO vs RL reasoning: detailed comparison of post-RLHF methods, with the verifiable-reward perspective
Constitutional AI: RLAIF, where preference data is generated by model self-critique against written principles
Hallucination theory: the Kalai-Nachum-Vempala-Zhang result that the binary preference grading used here drives confident-bluffing as the optimal benchmark policy

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Policy Gradient Theoremlayer 3 · tier 1
Reward Design and Reward Misspecificationlayer 3 · tier 1
RLHF and Alignmentlayer 4 · tier 2
Reinforcement Learning for Synthesis Planninglayer 4 · tier 3

Derived topics

4

Hallucination Theorylayer 4 · tier 1
Ineffable Intelligencelayer 4 · tier 2
Constitutional AIlayer 5 · tier 2
DPO vs GRPO vs RL for Reasoninglayer 5 · tier 2

Graph-backed continuations

DPO vs GRPO vs RL for Reasoning Constitutional AI Hallucination Theory Ineffable Intelligence