Actor-Critic Methods

Sneiderman, Robby

RL Theory

Actor-Critic Methods

The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Policy Gradient Theorem Q Learning Reward Systems and Reinforcement Learning Neuroscience Td Learning

Quiz (2)Prereq Map

Why This Matters

REINFORCE has unbiased gradients but impractical variance. Q-learning has low variance but cannot handle continuous actions and suffers from the deadly triad. Actor-critic methods combine the best of both: a critic (value function) reduces variance, while an actor (policy network) handles continuous and high-dimensional action spaces.

PPO, the most widely used actor-critic algorithm, trains the RLHF stage of ChatGPT, Claude, and virtually every instruction-tuned language model. SAC is the workhorse for continuous robotic control. Understanding actor-critic is understanding how modern AI systems are trained.

Mental Model

The actor proposes actions. The critic evaluates them. The actor improves by increasing the probability of actions the critic says are better than average (positive advantage) and decreasing the probability of actions that are worse than average (negative advantage). The critic improves by getting better at predicting how much total reward a state will yield.

This is policy iteration made differentiable: the critic does approximate policy evaluation, the actor does approximate policy improvement.

Formal Setup and Notation

We work in the MDP framework with a parameterized policy $\pi_\theta(a|s)$ (the actor) and a learned value function $V_\phi(s)$ (the critic). The advantage function is $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ .

Definition

TD Error $δ_{t}$

The temporal difference (TD) error at time $t$ is:

$\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$

This is a one-step estimate of the advantage: $\mathbb{E}[\delta_t | s_t, a_t] \approx A^\pi(s_t, a_t)$ when $V_\phi \approx V^\pi$ .

Definition

Generalized Advantage Estimation $G A E$

Generalized Advantage Estimation (GAE) is an exponentially weighted sum of multi-step TD errors:

$\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$

where $\lambda \in [0,1]$ controls the bias-variance tradeoff:

$\lambda = 0$ : one-step TD advantage $\hat{A}_t = \delta_t$ (low variance, high bias)
$\lambda = 1$ : Monte Carlo advantage $\hat{A}_t = G_t - V_\phi(s_t)$ (high variance, low bias)

Main Theorems

Proposition

GAE Bias-Variance Tradeoff

Statement

Define the $k$ -step advantage estimator:

$\hat{A}_t^{(k)} = \sum_{l=0}^{k-1} \gamma^l \delta_{t+l} = -V_\phi(s_t) + r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V_\phi(s_{t+k})$

The GAE estimator is the $\lambda$ -weighted average:

$\hat{A}_t^{\text{GAE}} = (1-\lambda) \sum_{k=1}^{\infty} \lambda^{k-1} \hat{A}_t^{(k)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$

When $V_\phi = V^\pi$ (perfect critic), all estimators are unbiased. When $V_\phi \neq V^\pi$ , the bias of the $k$ -step estimator is $O(\gamma^k \|V_\phi - V^\pi\|_\infty)$ . Longer rollouts reduce bias from an imperfect critic, but variance grows with $k$ because more stochastic rewards enter the estimate. The parameter $\lambda$ interpolates smoothly between these extremes.

Intuition

A perfect critic gives zero-variance advantage estimates via one-step TD. An imperfect critic introduces bias, which you can reduce by relying more on actual observed rewards (larger $\lambda$ ). GAE provides a smooth dial between trusting the critic ( $\lambda = 0$ ) and trusting the data ( $\lambda = 1$ ). In practice, $\lambda = 0.95$ is a common default.

Why It Matters

Without GAE, you must choose between one-step TD (biased but stable) and Monte Carlo returns (unbiased but noisy). GAE gives a principled middle ground and is used in every modern policy gradient implementation, including PPO.

report a correction →

A2C and A3C

A2C (Advantage Actor-Critic) runs multiple parallel environments synchronously:

Collect $n$ -step rollouts from $K$ parallel environments
Compute GAE advantages $\hat{A}_t$ for each rollout
Update the actor: $\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t$
Update the critic: $\phi \leftarrow \phi - \beta \sum_t \nabla_\phi (V_\phi(s_t) - G_t)^2$

A3C (Asynchronous Advantage Actor-Critic), introduced by Mnih et al. (2016, ICML, arXiv:1602.01783), uses asynchronous CPU workers that each maintain a local copy of the parameters and apply Hogwild-style updates to a shared parameter server. A3C was historically important as the first demonstration that on-policy actor-critic could match DQN on Atari without a replay buffer. It has been largely replaced by synchronous A2C and PPO, which are simpler to reason about and equally effective on GPU hardware. Clemente, Castejón, and Chandra (2017, arXiv:1708.05144) showed empirically that synchronous A2C matches or exceeds A3C when the same compute is used.

PPO: The Practical Standard

Proposition

PPO Clipped Surrogate Objective

Statement

Define the probability ratio $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ . The PPO-Clip objective is:

$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\Big( r_t(\theta) \hat{A}_t, \;\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \Big) \right]$

where $\epsilon$ (typically 0.1 to 0.2) limits how far the new policy can deviate from the old one in a single update.

Effect of the clipping:

When $\hat{A}_t > 0$ (good action): the objective is $\min(r_t \hat{A}_t, (1+\epsilon)\hat{A}_t)$ , capping the benefit of increasing the action probability
When $\hat{A}_t < 0$ (bad action): the objective is $\min(r_t \hat{A}_t, (1-\epsilon)\hat{A}_t)$ , capping the penalty reduction from decreasing the action probability

Intuition

TRPO solves a constrained optimization problem (maximize surrogate subject to KL constraint), which requires second-order methods. PPO approximates this with a simple clipping trick: if the policy ratio moves too far from 1, the gradient is zeroed out, preventing destructively large updates. The result is a first-order algorithm that is nearly as stable as TRPO and much simpler to implement.

Why It Matters

PPO is the algorithm used in RLHF for training instruction-following language models. Its simplicity (no KL constraint optimization, no conjugate gradients) makes it the default choice whenever policy optimization is needed. The clipping mechanism is the key innovation. It provides a soft trust region without the computational overhead of TRPO.

report a correction →

The full PPO loss combines three terms:

$L(\theta, \phi) = L^{\text{CLIP}}(\theta) - c_1 \, L^{\text{VF}}(\phi) + c_2 \, H[\pi_\theta]$

where $L^{\text{VF}} = (V_\phi(s_t) - G_t)^2$ is the value function loss and $H[\pi_\theta]$ is an entropy bonus encouraging exploration.

SAC: Maximum Entropy RL

Proposition

Soft Bellman Equation

Statement

In maximum entropy RL, the objective augments the standard return with an entropy term:

$J(\pi) = \sum_{t=0}^{\infty} \gamma^t \mathbb{E}\left[ r_t + \alpha \, H[\pi(\cdot|s_t)] \right]$

The corresponding soft Bellman equation is:

$V^\pi(s) = \mathbb{E}_{a \sim \pi}\left[ Q^\pi(s,a) - \alpha \log \pi(a|s) \right]$

$Q^\pi(s,a) = R(s,a) + \gamma \mathbb{E}_{s' \sim P}\left[ V^\pi(s') \right]$

The optimal policy is the Boltzmann distribution: $\pi^*(a|s) \propto \exp(Q^*(s,a)/\alpha)$ .

Intuition

SAC adds a reward for being uncertain. The agent is encouraged to find all good actions, not just the single best one. This leads to better exploration and more robust policies. The temperature $\alpha$ controls how much randomness the policy retains. high $\alpha$ means more exploration, low $\alpha$ approaches the standard (deterministic) optimal policy.

Why It Matters

SAC achieves strong performance on continuous control benchmarks (robotic locomotion, manipulation). Its key advantages: (1) it is off-policy (uses a replay buffer, so it is sample-efficient), (2) the entropy bonus provides automatic exploration, and (3) it avoids the brittle hyperparameter tuning of epsilon-greedy or noise-based exploration.

report a correction →

SAC maintains three networks: a policy $\pi_\theta$ (actor), two Q-functions $Q_{\phi_1}, Q_{\phi_2}$ (critics, using the double-Q trick to prevent overestimation), and optionally learns $\alpha$ automatically by targeting a desired entropy level.

Why Actor-Critic Dominates

Setting	Algorithm	Reason
LLM training (RLHF)	PPO	Discrete tokens, on-policy stability
Robotic control	SAC	Continuous actions, sample efficiency
Game playing	A2C / PPO	Parallel environments, scalability
Fine-tuning with verifiers	GRPO / PPO	Reward model as critic

The actor-critic framework is flexible enough to accommodate all these settings because it separates the policy representation (actor) from the value estimation (critic). You can swap architectures, loss functions, and training procedures while keeping the same fundamental structure.

Key Practical Defaults

The following hyperparameter defaults appear in the original papers and remain the starting point for practitioner code. They are not universal, but they are the numbers most re-implementations begin from.

Parameter	Default	Source
GAE $\lambda$	0.95	Schulman et al. (2016)
Discount $\gamma$	0.99	Schulman et al. (2016, 2017)
PPO clip $\epsilon$	0.2	Schulman et al. (2017, Atari setup)
PPO epochs per rollout	3 to 10	Schulman et al. (2017)
Value coefficient $c_1$	0.5 to 1.0	Schulman et al. (2017)
Entropy coefficient $c_2$	0.0 to 0.01	Schulman et al. (2017)
Gradient clip norm	0.5	Andrychowicz et al. (2021)
SAC target entropy	$-\dim(A)$	Haarnoja et al. (2018, v2)
SAC Polyak $\tau$	0.005	Haarnoja et al. (2018)

Andrychowicz et al. (2021, arXiv:2006.05990) tabulate sensitivity: advantage normalization is high-impact, specific learning-rate schedule is medium- impact, and several hyperparameters often treated as load-bearing in practitioner folklore (orthogonal init scale, value-loss clip range) have minor effect once others are tuned.

Common Confusions

Watch Out

The critic is not the reward model

In RLHF, the reward model scores outputs. The critic estimates the expected cumulative reward from a state under the current policy. The reward model is fixed after training; the critic is updated continuously during RL training. They serve different roles: the reward model defines the objective, the critic helps optimize it efficiently.

Watch Out

PPO is not the same as TRPO

TRPO solves a constrained optimization problem using conjugate gradients and line search. PPO uses a clipped surrogate objective with standard SGD. PPO does not enforce a hard KL constraint. The clipping is a soft approximation. In practice, PPO can violate the implicit trust region, which is both a feature (faster progress) and a risk (instability).

Watch Out

Actor-critic is not just REINFORCE with a baseline

REINFORCE with a baseline uses a learned value function to reduce variance but still uses Monte Carlo returns for the policy gradient. Actor-critic uses the critic for bootstrapping (TD estimates), which introduces bias but dramatically reduces variance. The distinction is whether the critic is used only as a baseline or also as a bootstrap target.

Watch Out

PPO's empirical performance is partly from implementation tricks, not the clipping theorem

The PPO-Clip objective is the theoretical contribution. Practitioner PPO includes extras that the original paper states briefly or not at all: advantage normalization per minibatch (subtract mean, divide by std), value function clipping, reward scaling, orthogonal weight initialization, learning-rate annealing, and gradient clipping at 0.5. Engstrom et al. (2020, ICLR, arXiv:2005.12729), "Implementation Matters in Deep Policy Gradients," removed these code-level optimizations and found that most of PPO's reported improvement over TRPO disappeared. Andrychowicz et al. (2021, ICLR, arXiv:2006.05990), "What Matters for On-Policy Deep Actor- Critic Methods?", ran 250k+ training runs and found that advantage normalization and the specific value-loss clipping choice were among the most load-bearing hyperparameters. Takeaway: if you reimplement PPO from the clipping equation alone, you will not reproduce the benchmark numbers. The details are not cosmetic.

Watch Out

PPO clipping caps commitment, not recovery

The min-operator in PPO-Clip is a pessimistic bound. It zeros the gradient only when the policy has already moved in the correct direction for the current advantage. The four cases:

$\hat{A}_t > 0$ and $r_t > 1+\epsilon$ : clipped, gradient zero. The good action is already $>(1+\epsilon)\times$ more likely than under the old policy, so stop reinforcing it.
$\hat{A}_t > 0$ and $r_t < 1-\epsilon$ : unclipped, gradient flows. The good action has been made less likely; recover it.
$\hat{A}_t < 0$ and $r_t < 1-\epsilon$ : clipped, gradient zero. The bad action is already $<(1-\epsilon)\times$ as likely; stop suppressing.
$\hat{A}_t < 0$ and $r_t > 1+\epsilon$ : unclipped, gradient flows. The bad action has been made more likely; push it back down.

The clip caps rewards for actions you have already committed to, but never caps corrective gradients when the policy is drifting the wrong way. Schulman et al. (2017, arXiv:1707.06347, Section 3) state this as a "pessimistic bound on the unclipped objective." Readers who memorize "clip at $1\pm\epsilon$ " often collapse the four cases into one and lose the recovery property.

Summary

Actor-critic = policy network (actor) + value network (critic)
Advantage $A(s,a) = Q(s,a) - V(s)$ centers the policy gradient, reducing variance
GAE: exponentially weighted multi-step TD errors, $\lambda$ controls bias-variance
PPO: clipped surrogate objective, simple first-order trust region approximation
SAC: maximum entropy RL, adds entropy bonus for exploration, off-policy
Actor-critic is the dominant paradigm: PPO for LLMs, SAC for robotics

Exercises

ExerciseCore

Problem

Compute the GAE advantage $\hat{A}_0^{\text{GAE}(\gamma, \lambda)}$ for the 3-step trajectory

r_0 = 1,\quad r_1 = 0,\quad r_2 = 1,\quad V_\phi(s_0) = 2,\quad V_\phi(s_1) = 1,\quad V_\phi(s_2) = 1.5,\quad V_\phi(s_3) = 0,\quad \gamma = 0.9,\quad \lambda = 0.95.

ExerciseCore

Problem

In the PPO clipped objective with $\epsilon = 0.2$ , if $r_t(\theta) = 1.5$ and $\hat{A}_t = 2$ , what is the clipped objective value? What gradient signal does the actor receive?

ExerciseCore

Problem

In the PPO clipped objective with $\epsilon = 0.2$ , you observe $\hat{A}_t = -1.5$ and $r_t(\theta) = 1.4$ . Which term of the $\min$ is selected, and in which direction does the gradient push $\theta$ ? Compare to the case $\hat{A}_t = -1.5$ and $r_t(\theta) = 0.6$ .

ExerciseAdvanced

Problem

Why does SAC use two Q-networks and take the minimum of their predictions? Relate this to the overestimation problem in Q-learning.

Related Comparisons

PPO vs. SAC

References

Canonical:

Konda & Tsitsiklis, "Actor-Critic Algorithms" (NeurIPS 2000). First rigorous convergence analysis of two-timescale actor-critic.
Sutton & Barto, Reinforcement Learning: An Introduction, 2nd ed., Ch. 13 "Policy Gradient Methods" (2018). Canonical textbook treatment of REINFORCE, actor-critic, and the policy gradient theorem.
Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (ICLR 2016, arXiv:1506.02438). Introduces GAE.
Mnih et al., "Asynchronous Methods for Deep Reinforcement Learning" (ICML 2016, arXiv:1602.01783). A3C and the synchronous A2C variant.

Current:

Schulman, Wolski, Dhariwal, Radford, Klimov, "Proximal Policy Optimization Algorithms" (2017, arXiv:1707.06347). PPO.
Haarnoja, Zhou, Abbeel, Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL" (ICML 2018, arXiv:1801.01290); with v2 update "Soft Actor-Critic Algorithms and Applications" (2018, arXiv:1812.05905). SAC and automatic entropy tuning.
Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry, "Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO" (ICLR 2020, arXiv:2005.12729). Shows most of PPO's empirical advantage comes from code-level optimizations, not the clipping theorem.
Andrychowicz, Raichuk, Stańczyk, Orsini, Girgin, Marinier, Hussenot, Geist, Pietquin, Michalski, Gelly, Bachem, "What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study" (ICLR 2021, arXiv:2006.05990). 250k+ training runs isolating each hyperparameter's effect.
Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (2022, arXiv:2203.02155). RLHF with PPO.
Clemente, Castejón, Chandra, "Efficient Parallel Methods for Deep Reinforcement Learning" (2017, arXiv:1708.05144). Synchronous A2C matches asynchronous A3C.

Next Topics

The natural next steps from actor-critic methods:

RLHF and alignment: applying PPO to language model training from human preferences
DPO vs. GRPO vs. RL reasoning: modern alternatives and extensions to the PPO-based RLHF pipeline

Last reviewed: April 24, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Q-Learninglayer 2 · tier 1
Policy Gradient Theoremlayer 3 · tier 1
Temporal Difference Learninglayer 2 · tier 2
Reward Systems and Reinforcement Learning Neurosciencelayer 4 · tier 3

Derived topics

5

DDPG: Deep Deterministic Policy Gradientlayer 3 · tier 2
Policy Optimization: PPO and TRPOlayer 3 · tier 2
RLHF and Alignmentlayer 4 · tier 2
DPO vs GRPO vs RL for Reasoninglayer 5 · tier 2
Deep RL for Controllayer 4 · tier 3

Graph-backed continuations

RLHF and Alignment DPO vs GRPO vs RL for Reasoning DDPG: Deep Deterministic Policy Gradient Deep RL for Control Policy Optimization: PPO and TRPO