Skip to main content

RL Theory

Actor-Critic Methods

The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.

AdvancedTier 2StableSupporting~55 min

Why This Matters

REINFORCE has unbiased gradients but impractical variance. Q-learning has low variance but cannot handle continuous actions and suffers from the deadly triad. Actor-critic methods combine the best of both: a critic (value function) reduces variance, while an actor (policy network) handles continuous and high-dimensional action spaces.

PPO, the most widely used actor-critic algorithm, trains the RLHF stage of ChatGPT, Claude, and virtually every instruction-tuned language model. SAC is the workhorse for continuous robotic control. Understanding actor-critic is understanding how modern AI systems are trained.

Mental Model

The actor proposes actions. The critic evaluates them. The actor improves by increasing the probability of actions the critic says are better than average (positive advantage) and decreasing the probability of actions that are worse than average (negative advantage). The critic improves by getting better at predicting how much total reward a state will yield.

This is policy iteration made differentiable: the critic does approximate policy evaluation, the actor does approximate policy improvement.

Formal Setup and Notation

We work in the MDP framework with a parameterized policy πθ(as)\pi_\theta(a|s) (the actor) and a learned value function Vϕ(s)V_\phi(s) (the critic). The advantage function is Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s).

Definition

TD Error

The temporal difference (TD) error at time tt is:

δt=rt+γVϕ(st+1)Vϕ(st)\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

This is a one-step estimate of the advantage: E[δtst,at]Aπ(st,at)\mathbb{E}[\delta_t | s_t, a_t] \approx A^\pi(s_t, a_t) when VϕVπV_\phi \approx V^\pi.

Definition

Generalized Advantage Estimation

Generalized Advantage Estimation (GAE) is an exponentially weighted sum of multi-step TD errors:

A^tGAE(γ,λ)=l=0(γλ)lδt+l\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}

where λ[0,1]\lambda \in [0,1] controls the bias-variance tradeoff:

  • λ=0\lambda = 0: one-step TD advantage A^t=δt\hat{A}_t = \delta_t (low variance, high bias)
  • λ=1\lambda = 1: Monte Carlo advantage A^t=GtVϕ(st)\hat{A}_t = G_t - V_\phi(s_t) (high variance, low bias)

Main Theorems

Proposition

GAE Bias-Variance Tradeoff

Statement

Define the kk-step advantage estimator:

A^t(k)=l=0k1γlδt+l=Vϕ(st)+rt+γrt+1++γk1rt+k1+γkVϕ(st+k)\hat{A}_t^{(k)} = \sum_{l=0}^{k-1} \gamma^l \delta_{t+l} = -V_\phi(s_t) + r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V_\phi(s_{t+k})

The GAE estimator is the λ\lambda-weighted average:

A^tGAE=(1λ)k=1λk1A^t(k)=l=0(γλ)lδt+l\hat{A}_t^{\text{GAE}} = (1-\lambda) \sum_{k=1}^{\infty} \lambda^{k-1} \hat{A}_t^{(k)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}

When Vϕ=VπV_\phi = V^\pi (perfect critic), all estimators are unbiased. When VϕVπV_\phi \neq V^\pi, the bias of the kk-step estimator is O(γkVϕVπ)O(\gamma^k \|V_\phi - V^\pi\|_\infty). Longer rollouts reduce bias from an imperfect critic, but variance grows with kk because more stochastic rewards enter the estimate. The parameter λ\lambda interpolates smoothly between these extremes.

Intuition

A perfect critic gives zero-variance advantage estimates via one-step TD. An imperfect critic introduces bias, which you can reduce by relying more on actual observed rewards (larger λ\lambda). GAE provides a smooth dial between trusting the critic (λ=0\lambda = 0) and trusting the data (λ=1\lambda = 1). In practice, λ=0.95\lambda = 0.95 is a common default.

Why It Matters

Without GAE, you must choose between one-step TD (biased but stable) and Monte Carlo returns (unbiased but noisy). GAE gives a principled middle ground and is used in every modern policy gradient implementation, including PPO.

A2C and A3C

A2C (Advantage Actor-Critic) runs multiple parallel environments synchronously:

  1. Collect nn-step rollouts from KK parallel environments
  2. Compute GAE advantages A^t\hat{A}_t for each rollout
  3. Update the actor: θθ+αtθlogπθ(atst)A^t\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t
  4. Update the critic: ϕϕβtϕ(Vϕ(st)Gt)2\phi \leftarrow \phi - \beta \sum_t \nabla_\phi (V_\phi(s_t) - G_t)^2

A3C (Asynchronous Advantage Actor-Critic), introduced by Mnih et al. (2016, ICML, arXiv:1602.01783), uses asynchronous CPU workers that each maintain a local copy of the parameters and apply Hogwild-style updates to a shared parameter server. A3C was historically important as the first demonstration that on-policy actor-critic could match DQN on Atari without a replay buffer. It has been largely replaced by synchronous A2C and PPO, which are simpler to reason about and equally effective on GPU hardware. Clemente, Castejón, and Chandra (2017, arXiv:1708.05144) showed empirically that synchronous A2C matches or exceeds A3C when the same compute is used.

PPO: The Practical Standard

Proposition

PPO Clipped Surrogate Objective

Statement

Define the probability ratio rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}. The PPO-Clip objective is:

LCLIP(θ)=Et[min(rt(θ)A^t,    clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\Big( r_t(\theta) \hat{A}_t, \;\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \Big) \right]

where ϵ\epsilon (typically 0.1 to 0.2) limits how far the new policy can deviate from the old one in a single update.

Effect of the clipping:

  • When A^t>0\hat{A}_t > 0 (good action): the objective is min(rtA^t,(1+ϵ)A^t)\min(r_t \hat{A}_t, (1+\epsilon)\hat{A}_t), capping the benefit of increasing the action probability
  • When A^t<0\hat{A}_t < 0 (bad action): the objective is min(rtA^t,(1ϵ)A^t)\min(r_t \hat{A}_t, (1-\epsilon)\hat{A}_t), capping the penalty reduction from decreasing the action probability

Intuition

TRPO solves a constrained optimization problem (maximize surrogate subject to KL constraint), which requires second-order methods. PPO approximates this with a simple clipping trick: if the policy ratio moves too far from 1, the gradient is zeroed out, preventing destructively large updates. The result is a first-order algorithm that is nearly as stable as TRPO and much simpler to implement.

Why It Matters

PPO is the algorithm used in RLHF for training instruction-following language models. Its simplicity (no KL constraint optimization, no conjugate gradients) makes it the default choice whenever policy optimization is needed. The clipping mechanism is the key innovation. It provides a soft trust region without the computational overhead of TRPO.

The full PPO loss combines three terms:

L(θ,ϕ)=LCLIP(θ)c1LVF(ϕ)+c2H[πθ]L(\theta, \phi) = L^{\text{CLIP}}(\theta) - c_1 \, L^{\text{VF}}(\phi) + c_2 \, H[\pi_\theta]

where LVF=(Vϕ(st)Gt)2L^{\text{VF}} = (V_\phi(s_t) - G_t)^2 is the value function loss and H[πθ]H[\pi_\theta] is an entropy bonus encouraging exploration.

SAC: Maximum Entropy RL

Proposition

Soft Bellman Equation

Statement

In maximum entropy RL, the objective augments the standard return with an entropy term:

J(π)=t=0γtE[rt+αH[π(st)]]J(\pi) = \sum_{t=0}^{\infty} \gamma^t \mathbb{E}\left[ r_t + \alpha \, H[\pi(\cdot|s_t)] \right]

The corresponding soft Bellman equation is:

Vπ(s)=Eaπ[Qπ(s,a)αlogπ(as)]V^\pi(s) = \mathbb{E}_{a \sim \pi}\left[ Q^\pi(s,a) - \alpha \log \pi(a|s) \right]

Qπ(s,a)=R(s,a)+γEsP[Vπ(s)]Q^\pi(s,a) = R(s,a) + \gamma \mathbb{E}_{s' \sim P}\left[ V^\pi(s') \right]

The optimal policy is the Boltzmann distribution: π(as)exp(Q(s,a)/α)\pi^*(a|s) \propto \exp(Q^*(s,a)/\alpha).

Intuition

SAC adds a reward for being uncertain. The agent is encouraged to find all good actions, not just the single best one. This leads to better exploration and more robust policies. The temperature α\alpha controls how much randomness the policy retains. high α\alpha means more exploration, low α\alpha approaches the standard (deterministic) optimal policy.

Why It Matters

SAC achieves strong performance on continuous control benchmarks (robotic locomotion, manipulation). Its key advantages: (1) it is off-policy (uses a replay buffer, so it is sample-efficient), (2) the entropy bonus provides automatic exploration, and (3) it avoids the brittle hyperparameter tuning of epsilon-greedy or noise-based exploration.

SAC maintains three networks: a policy πθ\pi_\theta (actor), two Q-functions Qϕ1,Qϕ2Q_{\phi_1}, Q_{\phi_2} (critics, using the double-Q trick to prevent overestimation), and optionally learns α\alpha automatically by targeting a desired entropy level.

Why Actor-Critic Dominates

SettingAlgorithmReason
LLM training (RLHF)PPODiscrete tokens, on-policy stability
Robotic controlSACContinuous actions, sample efficiency
Game playingA2C / PPOParallel environments, scalability
Fine-tuning with verifiersGRPO / PPOReward model as critic

The actor-critic framework is flexible enough to accommodate all these settings because it separates the policy representation (actor) from the value estimation (critic). You can swap architectures, loss functions, and training procedures while keeping the same fundamental structure.

Key Practical Defaults

The following hyperparameter defaults appear in the original papers and remain the starting point for practitioner code. They are not universal, but they are the numbers most re-implementations begin from.

ParameterDefaultSource
GAE λ\lambda0.95Schulman et al. (2016)
Discount γ\gamma0.99Schulman et al. (2016, 2017)
PPO clip ϵ\epsilon0.2Schulman et al. (2017, Atari setup)
PPO epochs per rollout3 to 10Schulman et al. (2017)
Value coefficient c1c_10.5 to 1.0Schulman et al. (2017)
Entropy coefficient c2c_20.0 to 0.01Schulman et al. (2017)
Gradient clip norm0.5Andrychowicz et al. (2021)
SAC target entropydim(A)-\dim(A)Haarnoja et al. (2018, v2)
SAC Polyak τ\tau0.005Haarnoja et al. (2018)

Andrychowicz et al. (2021, arXiv:2006.05990) tabulate sensitivity: advantage normalization is high-impact, specific learning-rate schedule is medium- impact, and several hyperparameters often treated as load-bearing in practitioner folklore (orthogonal init scale, value-loss clip range) have minor effect once others are tuned.

Common Confusions

Watch Out

The critic is not the reward model

In RLHF, the reward model scores outputs. The critic estimates the expected cumulative reward from a state under the current policy. The reward model is fixed after training; the critic is updated continuously during RL training. They serve different roles: the reward model defines the objective, the critic helps optimize it efficiently.

Watch Out

PPO is not the same as TRPO

TRPO solves a constrained optimization problem using conjugate gradients and line search. PPO uses a clipped surrogate objective with standard SGD. PPO does not enforce a hard KL constraint. The clipping is a soft approximation. In practice, PPO can violate the implicit trust region, which is both a feature (faster progress) and a risk (instability).

Watch Out

Actor-critic is not just REINFORCE with a baseline

REINFORCE with a baseline uses a learned value function to reduce variance but still uses Monte Carlo returns for the policy gradient. Actor-critic uses the critic for bootstrapping (TD estimates), which introduces bias but dramatically reduces variance. The distinction is whether the critic is used only as a baseline or also as a bootstrap target.

Watch Out

PPO's empirical performance is partly from implementation tricks, not the clipping theorem

The PPO-Clip objective is the theoretical contribution. Practitioner PPO includes extras that the original paper states briefly or not at all: advantage normalization per minibatch (subtract mean, divide by std), value function clipping, reward scaling, orthogonal weight initialization, learning-rate annealing, and gradient clipping at 0.5. Engstrom et al. (2020, ICLR, arXiv:2005.12729), "Implementation Matters in Deep Policy Gradients," removed these code-level optimizations and found that most of PPO's reported improvement over TRPO disappeared. Andrychowicz et al. (2021, ICLR, arXiv:2006.05990), "What Matters for On-Policy Deep Actor- Critic Methods?", ran 250k+ training runs and found that advantage normalization and the specific value-loss clipping choice were among the most load-bearing hyperparameters. Takeaway: if you reimplement PPO from the clipping equation alone, you will not reproduce the benchmark numbers. The details are not cosmetic.

Watch Out

PPO clipping caps commitment, not recovery

The min-operator in PPO-Clip is a pessimistic bound. It zeros the gradient only when the policy has already moved in the correct direction for the current advantage. The four cases:

  • A^t>0\hat{A}_t > 0 and rt>1+ϵr_t > 1+\epsilon: clipped, gradient zero. The good action is already >(1+ϵ)×>(1+\epsilon)\times more likely than under the old policy, so stop reinforcing it.
  • A^t>0\hat{A}_t > 0 and rt<1ϵr_t < 1-\epsilon: unclipped, gradient flows. The good action has been made less likely; recover it.
  • A^t<0\hat{A}_t < 0 and rt<1ϵr_t < 1-\epsilon: clipped, gradient zero. The bad action is already <(1ϵ)×<(1-\epsilon)\times as likely; stop suppressing.
  • A^t<0\hat{A}_t < 0 and rt>1+ϵr_t > 1+\epsilon: unclipped, gradient flows. The bad action has been made more likely; push it back down.

The clip caps rewards for actions you have already committed to, but never caps corrective gradients when the policy is drifting the wrong way. Schulman et al. (2017, arXiv:1707.06347, Section 3) state this as a "pessimistic bound on the unclipped objective." Readers who memorize "clip at 1±ϵ1\pm\epsilon" often collapse the four cases into one and lose the recovery property.

Summary

  • Actor-critic = policy network (actor) + value network (critic)
  • Advantage A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s) centers the policy gradient, reducing variance
  • GAE: exponentially weighted multi-step TD errors, λ\lambda controls bias-variance
  • PPO: clipped surrogate objective, simple first-order trust region approximation
  • SAC: maximum entropy RL, adds entropy bonus for exploration, off-policy
  • Actor-critic is the dominant paradigm: PPO for LLMs, SAC for robotics

Exercises

ExerciseCore

Problem

Compute the GAE advantage A^0GAE(γ,λ)\hat{A}_0^{\text{GAE}(\gamma, \lambda)} for the 3-step trajectory

r0=1,r1=0,r2=1,Vϕ(s0)=2,Vϕ(s1)=1,Vϕ(s2)=1.5,Vϕ(s3)=0,γ=0.9,λ=0.95.r_0 = 1,\quad r_1 = 0,\quad r_2 = 1,\quad V_\phi(s_0) = 2,\quad V_\phi(s_1) = 1,\quad V_\phi(s_2) = 1.5,\quad V_\phi(s_3) = 0,\quad \gamma = 0.9,\quad \lambda = 0.95.
ExerciseCore

Problem

In the PPO clipped objective with ϵ=0.2\epsilon = 0.2, if rt(θ)=1.5r_t(\theta) = 1.5 and A^t=2\hat{A}_t = 2, what is the clipped objective value? What gradient signal does the actor receive?

ExerciseCore

Problem

In the PPO clipped objective with ϵ=0.2\epsilon = 0.2, you observe A^t=1.5\hat{A}_t = -1.5 and rt(θ)=1.4r_t(\theta) = 1.4. Which term of the min\min is selected, and in which direction does the gradient push θ\theta? Compare to the case A^t=1.5\hat{A}_t = -1.5 and rt(θ)=0.6r_t(\theta) = 0.6.

ExerciseAdvanced

Problem

Why does SAC use two Q-networks and take the minimum of their predictions? Relate this to the overestimation problem in Q-learning.

Related Comparisons

References

Canonical:

  • Konda & Tsitsiklis, "Actor-Critic Algorithms" (NeurIPS 2000). First rigorous convergence analysis of two-timescale actor-critic.
  • Sutton & Barto, Reinforcement Learning: An Introduction, 2nd ed., Ch. 13 "Policy Gradient Methods" (2018). Canonical textbook treatment of REINFORCE, actor-critic, and the policy gradient theorem.
  • Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (ICLR 2016, arXiv:1506.02438). Introduces GAE.
  • Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning" (ICML 2016, arXiv:1602.01783). A3C and the synchronous A2C variant.

Current:

  • Schulman, Wolski, Dhariwal, Radford, Klimov, "Proximal Policy Optimization Algorithms" (2017, arXiv:1707.06347). PPO.
  • Haarnoja, Zhou, Abbeel, Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL" (ICML 2018, arXiv:1801.01290); with v2 update "Soft Actor-Critic Algorithms and Applications" (2018, arXiv:1812.05905). SAC and automatic entropy tuning.
  • Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry, "Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO" (ICLR 2020, arXiv:2005.12729). Shows most of PPO's empirical advantage comes from code-level optimizations, not the clipping theorem.
  • Andrychowicz, Raichuk, Stańczyk, Orsini, Girgin, Marinier, Hussenot, Geist, Pietquin, Michalski, Gelly, Bachem, "What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study" (ICLR 2021, arXiv:2006.05990). 250k+ training runs isolating each hyperparameter's effect.
  • Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (2022, arXiv:2203.02155). RLHF with PPO.
  • Clemente, Castejón, Chandra, "Efficient Parallel Methods for Deep Reinforcement Learning" (2017, arXiv:1708.05144). Synchronous A2C matches asynchronous A3C.

Next Topics

The natural next steps from actor-critic methods:

Last reviewed: April 24, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

5