Skip to main content

RL Theory

Policy Gradient Theorem

The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.

AdvancedTier 1StableCore spine~75 min

Why This Matters

policy density
current μμ after steptarget side a*-3-2-10123aQ-value-0.4+0.2+0.8+1.3sample push
per-sample arrows+0.60+0.04+1.36+3.90mean Δμ

The Bellman equations give us a clean route to optimal control when the state space is small and dynamic programming is tractable. That is not the regime that matters most in modern RL. Atari from pixels, continuous robot control, large recommender systems, and RLHF for language models all require optimizing a parameterized policy directly.

In these settings, we parameterize the policy as πθ\pi_\theta (e.g., a neural network) and optimize the parameters θ\theta by gradient ascent on the expected return. The policy gradient theorem tells us how to compute this gradient, even though the expected return depends on the entire trajectory distribution, which itself depends on θ\theta in a complicated way.

This theorem is the foundation of REINFORCE, actor-critic methods, PPO, and the RLHF pipeline used to train modern language models.

Mental Model

We want to maximize J(θ)=Eτπθ[R(τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)], the expected total reward under the policy. The challenge is that changing θ\theta changes the probability of every trajectory τ\tau, which is a product of many policy and transition probabilities.

The log-derivative trick resolves this: instead of differentiating through the trajectory distribution directly, we express the gradient as an expectation under the current policy. This means we can estimate the gradient from sampled experience; no differentiable environment model is required.

The theorem packages two effects into one expectation:

  • the discounted occupancy dγπθ(s)d_\gamma^{\pi_\theta}(s) says which states the current policy actually visits;
  • the score term θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) says how a parameter change reweights actions inside one visited state.

The non-obvious part is that the derivative of the state distribution does not disappear. It is absorbed into the visitation weights, which is why the final formula can be estimated from on-policy rollouts without differentiating the environment.

Formal Setup and Notation

We work in the MDP framework (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma) with a parameterized stochastic policy πθ(as)\pi_\theta(a|s). A trajectory is τ=(s0,a0,r0,s1,a1,r1,)\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots).

Definition

Policy Objective

The policy objective is the expected discounted return:

J(θ)=Eτπθ[t=0γtrt]=sdγπθ(s)aπθ(as)Qπθ(s,a)J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right] = \sum_s d^{\pi_\theta}_\gamma(s) \sum_a \pi_\theta(a|s) Q^{\pi_\theta}(s,a)

where dγπθ(s)=(1γ)t=0γtP(st=sπθ)d^{\pi_\theta}_\gamma(s) = (1-\gamma)\sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \pi_\theta) is the discounted state-visitation distribution. This is the distribution that makes the Sutton et al. (2000) policy gradient theorem exact. In most deep-RL implementations the gradient is estimated using the undiscounted empirical distribution of states visited along sampled trajectories, which introduces a well-documented bias. See Thomas (2014), "Bias in Natural Actor-Critic Algorithms," and Nota and Thomas (2020), "Is the Policy Gradient a Gradient?"

Definition

Advantage Function

The advantage function measures how much better action aa is compared to the average action under π\pi:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

By construction, aπ(as)Aπ(s,a)=0\sum_a \pi(a|s) A^\pi(s,a) = 0 for all ss.

Main Theorems

Theorem

Policy Gradient Theorem

Statement

The gradient of the policy objective is:

θJ(θ)=Esdπθ,aπθ[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \, Q^{\pi_\theta}(s,a)\right]

Equivalently, summing over time steps along a trajectory:

θJ(θ)=Eτπθ[t=0γtθlogπθ(atst)Qπθ(st,at)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t \nabla_\theta \log \pi_\theta(a_t|s_t) \, Q^{\pi_\theta}(s_t, a_t)\right]

Intuition

θlogπθ(as)\nabla_\theta \log \pi_\theta(a|s) points in the direction that increases the probability of action aa in state ss. The theorem says: weight this direction by how good that action is (QQ-value), and average over the states and actions you actually visit. Actions that lead to high returns get their probabilities increased; actions that lead to low returns get decreased.

Proof Sketch

Write Vπθ(s)=aπθ(as)Qπθ(s,a)V^{\pi_\theta}(s) = \sum_a \pi_\theta(a \mid s) Q^{\pi_\theta}(s,a) and differentiate. Using Qπθ(s,a)=r(s,a)+γsP(ss,a)Vπθ(s)Q^{\pi_\theta}(s,a) = r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^{\pi_\theta}(s'):

θVπθ(s)=a[θπθ(as)Qπθ(s,a)+πθ(as)θQπθ(s,a)].\nabla_\theta V^{\pi_\theta}(s) = \sum_a \bigl[ \nabla_\theta \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s,a) + \pi_\theta(a \mid s) \, \nabla_\theta Q^{\pi_\theta}(s,a) \bigr].

The r(s,a)r(s,a) term has no θ\theta-dependence, so θQπθ(s,a)=γsP(ss,a)θVπθ(s)\nabla_\theta Q^{\pi_\theta}(s,a) = \gamma \sum_{s'} P(s' \mid s,a) \nabla_\theta V^{\pi_\theta}(s'). Substituting gives a one-step recursion on θVπθ\nabla_\theta V^{\pi_\theta}.

Unrolling the recursion yields

θVπθ(s0)=sk=0γkPr(s0s in k steps;πθ)η(s)aθπθ(as)Qπθ(s,a),\nabla_\theta V^{\pi_\theta}(s_0) = \sum_{s} \underbrace{\sum_{k=0}^{\infty} \gamma^k \Pr(s_0 \to s \text{ in } k \text{ steps}; \pi_\theta)}_{\eta(s)} \, \sum_a \nabla_\theta \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s,a),

where η(s)\eta(s) is an unnormalized discounted visit count. The normalization is exact: dγπθ(s)=(1γ)η(s)d^{\pi_\theta}_\gamma(s) = (1-\gamma) \, \eta(s), since sη(s)=k=0γk=1/(1γ)\sum_s \eta(s) = \sum_{k=0}^\infty \gamma^k = 1/(1-\gamma). The (1γ)(1-\gamma) factor matters. Sutton, McAllester, Singh, Mansour 2000 state it explicitly, but many practical implementations drop it and use an estimator based on the undiscounted empirical state distribution, which is biased. See the Thomas (2014) and Nota and Thomas (2020) references above.

What this calculation delivers: the messy θQπθ\nabla_\theta Q^{\pi_\theta} and θdπθ\nabla_\theta d^{\pi_\theta} terms telescope into the discounted state-visitation weights; the gradient acts on πθ\pi_\theta only.

Now apply the log-derivative trick θπθ(as)=πθ(as)θlogπθ(as)\nabla_\theta \pi_\theta(a \mid s) = \pi_\theta(a \mid s) \, \nabla_\theta \log \pi_\theta(a \mid s) and rescale η\eta by (1γ)(1-\gamma) to the distribution dγπθd^{\pi_\theta}_\gamma:

θJ(θ)=Esdπθ,aπθ[θlogπθ(as)Qπθ(s,a)].\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, \, a \sim \pi_\theta}\bigl[ \nabla_\theta \log \pi_\theta(a \mid s) \, Q^{\pi_\theta}(s,a) \bigr].

The log-derivative step is what turns a sum over the discrete action set into an expectation you can estimate with samples, which is why this theorem is the starting point for every practical policy-gradient algorithm.

Why It Matters

This theorem makes policy optimization tractable. To estimate θJ(θ)\nabla_\theta J(\theta), sample trajectories under πθ\pi_\theta, compute θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t|s_t) by backpropagating through the policy network, multiply by a return estimate, and average. No environment model is needed. This is the basis of all policy gradient algorithms.

Failure Mode

The naive estimator has extremely high variance. A single trajectory gives a very noisy gradient estimate. Variance reduction techniques (baselines, advantage estimation) are essential for practical use.

Proposition

Baseline Does Not Change the Gradient

Statement

For any function b:SRb: \mathcal{S} \to \mathbb{R} that depends only on the state:

Eaπθ(s)[θlogπθ(as)b(s)]=0\mathbb{E}_{a \sim \pi_\theta(\cdot|s)}\left[\nabla_\theta \log \pi_\theta(a|s) \, b(s)\right] = 0

Therefore, replacing Qπθ(s,a)Q^{\pi_\theta}(s,a) with Aπθ(s,a)=Qπθ(s,a)Vπθ(s)A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s) in the policy gradient does not change its expectation but reduces variance.

Intuition

This is the expected grad-log-prob (EGLP) lemma: Eaπθ[θlogπθ(as)]=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \mid s)] = 0 for any ss (a standard property of score functions in any parametric family with differentiable density). Multiplying by a state-only function b(s)b(s) pulls b(s)b(s) out of the expectation, so it still evaluates to zero. Subtracting a state-dependent constant from the reward therefore does not bias the gradient, but it does change the variance. Vπ(s)V^\pi(s) is usually a good baseline because it centers the returns around zero.

Proof Sketch

Eaπθ[θlogπθ(as)b(s)]=b(s)aθπθ(as)=b(s)θaπθ(as)=b(s)θ1=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \, b(s)] = b(s) \sum_a \nabla_\theta \pi_\theta(a|s) = b(s) \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \nabla_\theta 1 = 0

The middle step uses the EGLP identity aθπθ(as)=θaπθ(as)=θ1=0\sum_a \nabla_\theta \pi_\theta(a \mid s) = \nabla_\theta \sum_a \pi_\theta(a \mid s) = \nabla_\theta 1 = 0, which is exactly why the score function has zero mean.

Why It Matters

This is why subtracting the value function baseline Vπ(s)V^\pi(s) to form the advantage Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s) is everywhere in modern RL. The baseline is VV. The advantage is what remains after subtracting it. Using the advantage instead of the raw QQ-value dramatically reduces variance, making practical training possible. The value function VπV^\pi is the standard practical baseline, not the variance-minimizing one. The exact minimum-variance baseline is derived in the exercises.

Where the Variance Comes From

The theorem gives an unbiased gradient identity; it does not give a low-variance estimator. Most practical policy-gradient engineering is about removing variance without quietly changing the objective.

Noise sourceWhat it does to the estimatorStandard fix
action samplingDifferent sampled actions produce different score vectors even in the same statelarger batches, lower-entropy policies late in training
long-horizon returnsA reward at time t+20t+20 perturbs the update for many earlier actionsreward-to-go, baselines, GAE
state visitation randomnessRare but important states are sampled unevenly across rolloutson-policy data refresh, trust-region updates, better exploration
critic approximation errorreplacing returns with A^t\hat A_t lowers variance but can add biasvalue-loss tuning, target stabilization, advantage normalization

This is why the clean theorem lives next to messy algorithms. The identity is simple; the estimator quality depends on how much of this noise you remove.

Five Valid Weights for the Score Function

The policy gradient takes the general form

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Φt],\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, \Phi_t\right],

and there are several valid choices for the weight Φt\Phi_t that share the same expectation but differ wildly in variance. Spinning Up enumerates five:

  1. Full return of the trajectory: Φt=R(τ)=k=0Tγkrk\Phi_t = R(\tau) = \sum_{k=0}^{T} \gamma^k r_k. The naive REINFORCE weighting. Unbiased but very high variance.
  2. Reward-to-go: Φt=k=tTγktrk\Phi_t = \sum_{k=t}^{T} \gamma^{k-t} r_k. Drops rewards received before time tt, which cannot depend on action ata_t.
  3. Reward-to-go minus a state baseline: Φt=k=tTγktrkb(st)\Phi_t = \sum_{k=t}^{T} \gamma^{k-t} r_k - b(s_t). Valid for any bb by the EGLP lemma; b(s)=Vπ(s)b(s) = V^\pi(s) is the standard choice.
  4. Action-value function: Φt=Qπ(st,at)\Phi_t = Q^\pi(s_t, a_t). This is the exact on-policy weight that emerges from the theorem.
  5. Advantage function: Φt=Aπ(st,at)=Qπ(st,at)Vπ(st)\Phi_t = A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t). Case 4 with the value-function baseline. This is what almost every modern algorithm (A2C, PPO, TRPO, SAC with actor updates) actually uses.

All five are unbiased estimators of the policy gradient. Variance drops sharply from form 1 to form 5 because each step removes a zero-mean noise source.

Watch Out

Don't let the past distract you

Why is the reward-to-go (form 2) better than the full return (form 1)? Both are unbiased. But rewards received before time tt are independent of the action ata_t taken at time tt, so E[θlogπθ(atst)rk<t]=0\mathbb{E}[\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot r_{k<t}] = 0 by the EGLP lemma. Past rewards contribute zero to the expectation but nonzero to the variance of the sample estimate. Dropping them is pure variance reduction, no bias introduced. Spinning Up calls this "don't let the past distract you."

The REINFORCE Algorithm

The simplest policy gradient algorithm. Sample a trajectory τ=(s0,a0,r0,,sT,aT,rT)\tau = (s_0, a_0, r_0, \ldots, s_T, a_T, r_T) under πθ\pi_\theta, then update:

θθ+αt=0Tθlogπθ(atst)Gt\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, G_t

where Gt=k=tTγktrkG_t = \sum_{k=t}^{T} \gamma^{k-t} r_k is the return from time tt.

REINFORCE is unbiased but has high variance. In practice, subtract a baseline:

θθ+αt=0Tθlogπθ(atst)(Gtb(st))\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, (G_t - b(s_t))

Actor-Critic Methods

Instead of waiting for a full trajectory to compute GtG_t, use a learned value function Vϕ(s)V_\phi(s) as both a baseline and a bootstrap target:

A^t=rt+γVϕ(st+1)Vϕ(st)=δt\hat{A}_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) = \delta_t

This is the TD(0) advantage estimate, the one-step TD residual δt\delta_t. Update the policy (actor) using:

θθ+αθlogπθ(atst)A^t\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t

Update the value function (critic) by minimizing (Vϕ(st)Gt)2(V_\phi(s_t) - G_t)^2 or using TD learning. Actor-critic trades some bias (from bootstrapping) for dramatically lower variance compared to REINFORCE. The 1-step residual δt\delta_t and the Monte Carlo return GtG_t are two ends of a spectrum; n-step returns and the GAE(λ)(\lambda) estimator below interpolate between them.

Natural Policy Gradient

The natural gradient is due to Amari (1998), "Natural Gradient Works Efficiently in Learning," which applies information geometry to parameter estimation. Kakade (2001), "A Natural Policy Gradient," transferred the idea to RL by identifying the policy Fisher information as the correct metric on policy space.

Proposition

Natural Policy Gradient

Statement

The Fisher information matrix of the policy is:

F(θ)=Esdπθ,aπθ[θlogπθ(as)θlogπθ(as)]F(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta}, a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \, \nabla_\theta \log \pi_\theta(a|s)^\top\right]

The natural policy gradient is:

~θJ(θ)=F(θ)1θJ(θ)\tilde{\nabla}_\theta J(\theta) = F(\theta)^{-1} \nabla_\theta J(\theta)

This is the steepest ascent direction in the KL-divergence metric on policy space, rather than in Euclidean parameter space.

Intuition

The standard gradient depends on the parameterization: reparameterizing θ\theta changes the gradient direction. The natural gradient is invariant to reparameterization. It asks: what is the best policy update of a given KL size? This is the right question because we care about how much the policy changes, not how much the parameters change.

Why It Matters

The natural policy gradient leads to more stable updates and is the theoretical foundation for TRPO. PPO shares the KL-bounded surrogate framing but is a first-order method, not a natural-gradient approximation.

theorem visual

Natural Gradient Invariance Explorer

Natural gradient rescales the step by the Fisher metric, so the move is defined in local KL geometry rather than by the raw coordinate chart.

mu = 1.35

Euclidean step
Natural step
Current distribution

mu chart

coordinate lossraw parameter

mu^3 chart

reparameterized lossdistorted chart

distribution-space readout

same distributional move across charts

in

0.015

in

0.000

in

0.004

in

0.004

Fisher metric

In this Gaussian mean family, the Fisher metric sets the local KL scale that natural gradient uses to size the step.

Local KL

The metric is the local quadratic form of KL. Equal natural steps should therefore spend the same local KL budget.

Invariant move

Euclidean arrows depend on the chart. The natural arrows induce the same move in distribution space, which is the whole point of the metric correction.

The point of the explorer is simple: Euclidean parameter distance is chart dependent, but KL distance in policy space is not. Natural gradient chooses the update that is locally steepest in the geometry that the policy itself induces.

Connection to Trust Regions

The natural policy gradient motivates trust region methods:

TRPO (Trust Region Policy Optimization), Schulman et al. (2015), arXiv 1502.05477. Maximize the surrogate objective E[πθ(as)πθold(as)A^(s,a)]\mathbb{E}[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}(s,a)] subject to KL(πθoldπθ)δ\text{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta) \leq \delta. The step direction is computed by linearizing the objective and quadratically approximating the KL constraint, which reduces to solving Fv=θLF \mathbf{v} = \nabla_\theta L for the natural-gradient direction. TRPO solves this linear system with conjugate gradients using Fisher-vector products (no explicit Fisher matrix is formed), then performs a backtracking line search along that direction to enforce the exact KL bound and verify surrogate improvement. The CG plus line-search structure is what makes TRPO a trust-region method, not a pure natural-gradient step.

PPO (Proximal Policy Optimization), Schulman et al. (2017). Replace the hard KL constraint with a clipped surrogate:

LCLIP(θ)=E[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]

where rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t). PPO is a first-order method that takes Adam steps on LCLIPL^{\text{CLIP}} with standard backpropagation. It is not an approximate natural-gradient method, despite a common misreading. Engstrom et al. (2020), "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729), and Ilyas et al. (2020), "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553), show that PPO's empirical advantage over vanilla policy gradient comes substantially from implementation details: value-function clipping, reward and advantage normalization, orthogonal weight initialization, learning-rate annealing, and observation normalization. The clipped objective alone does not account for PPO's reported gains. PPO is the algorithm used in the RLHF pipeline for language model training, but using it should not be confused with approximating the natural gradient.

Generalized Advantage Estimation

REINFORCE uses the full Monte Carlo return GtG_t. The 1-step TD advantage uses A^t(1)=rt+γVϕ(st+1)Vϕ(st)=δt\hat{A}_t^{(1)} = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) = \delta_t. These are two endpoints of a spectrum. n-step returns sit between them:

A^t(n)=l=0n1γlδt+l\hat{A}_t^{(n)} = \sum_{l=0}^{n-1} \gamma^l \delta_{t+l}

Schulman et al. (2016), "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438), introduce the GAE(λ)(\lambda) estimator as an exponentially weighted average over n-step advantages:

A^tGAE(γ,λ)=l=0(γλ)lδt+l,δt=rt+γVϕ(st+1)Vϕ(st)\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}, \qquad \delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

Setting λ=0\lambda = 0 recovers the 1-step TD advantage (low variance, high bias from VϕV_\phi). Setting λ=1\lambda = 1 recovers Monte Carlo with a value baseline (unbiased, high variance). Intermediate λ(0,1)\lambda \in (0,1) trades bias against variance. GAE is the standard advantage estimator used with PPO and TRPO in practice.

What the Theorem Does Not Give You for Free

Watch Out

Action-dependent baselines need a correction term

The state-only baseline result is exact because Eaπθ(s)[θlogπθ(as)b(s)]=0\mathbb{E}_{a \sim \pi_\theta(\cdot|s)}[\nabla_\theta \log \pi_\theta(a|s)\, b(s)] = 0. If you replace b(s)b(s) with an action-dependent term b(s,a)b(s,a), that identity fails in general. The estimator becomes biased unless you add an explicit correction. This is why the standard theorem is careful about state-only baselines.

Watch Out

The theorem is on-policy

The expectation in the policy gradient theorem is under the current policy's occupancy measure. If the data come from an older policy or a replay buffer, you are no longer estimating the same expectation unless you add importance weights or use an off-policy actor-critic derivation. Off-policy data reuse is possible, but it is not a free corollary of the theorem.

Actor-Critic Variants

Beyond the vanilla actor-critic above, several variants are in common use:

  • A2C and A3C (Mnih et al. 2016, arXiv 1602.01783). Synchronous and asynchronous parallel actor-learners sharing a global network. A3C predates GPU-dominant training; A2C is the synchronous equivalent that is typically preferred on modern hardware.
  • DPG and DDPG. The deterministic policy gradient theorem (Silver et al. 2014) extends the stochastic PG theorem to deterministic policies a=μθ(s)a = \mu_\theta(s): the gradient becomes θJ=Esdμ[θμθ(s)aQμ(s,a)a=μθ(s)]\nabla_\theta J = \mathbb{E}_{s \sim d^\mu}[\nabla_\theta \mu_\theta(s) \, \nabla_a Q^\mu(s,a)|_{a = \mu_\theta(s)}]. DDPG (Lillicrap et al. 2016, arXiv 1509.02971) is the deep, off-policy actor-critic instantiation for continuous control.
  • SAC (Haarnoja et al. 2018, arXiv 1801.01290). Soft Actor-Critic adds an entropy bonus to the reward, optimizing E[trt+αH(π(st))]\mathbb{E}[\sum_t r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))]. Off-policy, stochastic, and the dominant algorithm on many continuous-control benchmarks.

Why the Log-Derivative Trick Works

The identity θπθ(as)=πθ(as)θlogπθ(as)\nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) converts a gradient of a probability into an expectation under that probability. This is the same trick used in variational inference (the ELBO gradient) and in score-based diffusion models. It works whenever you need to differentiate an expectation with respect to parameters of the distribution.

The key property is that Eaπθ[θlogπθ(as)]=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0. The score function has zero mean. This is what makes baselines valid.

Common Confusions

Watch Out

Policy gradients are not backpropagation through the environment

A common misconception is that policy gradients differentiate through the environment dynamics. They do not. The gradient goes only through logπθ\log \pi_\theta. The policy network. The environment is treated as a black box that produces rewards and next states. This is why policy gradients work in environments where the dynamics are unknown.

Watch Out

REINFORCE is unbiased but not practical without variance reduction

The raw REINFORCE estimator has the correct expectation but enormous variance. A single trajectory might have a return of 100 or -50 depending on luck. Without a baseline, you need an impractical number of samples. The advantage function and multiple-sample estimators are not optional extras; they are necessary for the algorithm to work at all.

Watch Out

Natural gradient is not just preconditioning

While the natural gradient can be seen as preconditioning the gradient by F1F^{-1}, its justification is geometric: it is the steepest ascent direction when distance is measured by KL divergence between policies, not Euclidean distance between parameters. This distinction matters because the same KL change can correspond to very different parameter changes depending on the region of parameter space.

Summary

  • The policy gradient is θJ(θ)=E[θlogπθ(as)Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \, Q^{\pi_\theta}(s,a)]
  • The log-derivative trick converts a gradient of an expectation into an expectation of a gradient
  • Subtracting a state-dependent baseline does not change the expected gradient but reduces variance
  • The value function VπV^\pi is the canonical baseline. The advantage Aπ=QπVπA^\pi = Q^\pi - V^\pi is what remains after subtraction, not the baseline itself
  • The exact variance-minimizing baseline is a logπ2\|\nabla \log \pi\|^2-weighted average of QQ-values (Weaver and Tao 2001, Greensmith et al. 2004), not VπV^\pi
  • Actor-critic: learn VϕV_\phi as a baseline, bootstrap to reduce variance at the cost of some bias
  • GAE(λ)(\lambda) interpolates between 1-step TD advantage and the full Monte Carlo return
  • Natural policy gradient (Amari 1998, Kakade 2001) uses Fisher information to make updates invariant to parameterization
  • TRPO (Schulman et al. 2015) is a practical natural-gradient method using conjugate gradients and a backtracking line search
  • PPO is a first-order clipped-objective method, not a natural-gradient approximation; its empirical gains come substantially from implementation details

Exercises

ExerciseCore

Problem

Show that Eaπθ[θlogπθ(as)]=0\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0 for any state ss, assuming πθ(as)>0\pi_\theta(a|s) > 0 for all aa.

ExerciseCore

Problem

For a policy πθ\pi_\theta that is softmax over two actions with logits θ1,θ2\theta_1, \theta_2, compute θlogπθ(a1s)\nabla_\theta \log \pi_\theta(a_1|s).

ExerciseAdvanced

Problem

Show that replacing the full trajectory return R(τ)R(\tau) with the reward-to-go Gt=k=tTγktrkG_t = \sum_{k=t}^{T}\gamma^{k-t} r_k leaves the REINFORCE estimator unbiased. In other words, prove that rewards received before time tt do not contribute to the expectation of the tt-th score term.

ExerciseAdvanced

Problem

Prove that the variance-minimizing baseline b(s)b^*(s) for the REINFORCE estimator is b(s)=Ea[θlogπθ(as)2Qπ(s,a)]Ea[θlogπθ(as)2]b^*(s) = \frac{\mathbb{E}_a[\|\nabla_\theta \log \pi_\theta(a|s)\|^2 Q^{\pi}(s,a)]}{\mathbb{E}_a[\|\nabla_\theta \log \pi_\theta(a|s)\|^2]} (a weighted average of QQ-values, not simply Vπ(s)V^\pi(s)).

Related Comparisons

References

Canonical:

  • Sutton, McAllester, Singh, Mansour, "Policy Gradient Methods for RL with Function Approximation" (2000)
  • Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist RL" (1992). original REINFORCE
  • Amari, "Natural Gradient Works Efficiently in Learning," Neural Computation (1998). origin of the natural gradient
  • Kakade, "A Natural Policy Gradient" (2001). natural gradient transferred to RL
  • Silver, Lever, Heess, Degris, Wierstra, Riedmiller, "Deterministic Policy Gradient Algorithms" (2014). DPG theorem

Variance reduction and advantage estimation:

  • Weaver and Tao, "The Optimal Reward Baseline for Gradient-Based Reinforcement Learning" (2001)
  • Greensmith, Bartlett, Baxter, "Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning" (2004). exact minimum-variance baseline
  • Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438, 2016). GAE
  • Thomas, "Bias in Natural Actor-Critic Algorithms" (2014). discounted vs undiscounted state distribution
  • Nota and Thomas, "Is the Policy Gradient a Gradient?" (2020)

Trust-region and PPO:

  • Schulman, Levine, Abbeel, Jordan, Moritz, "Trust Region Policy Optimization" (arXiv 1502.05477, 2015). TRPO with CG and line search
  • Schulman, Wolski, Dhariwal, Radford, Klimov, "Proximal Policy Optimization Algorithms" (arXiv 1707.06347, 2017)
  • Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry, "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729, 2020)
  • Ilyas, Engstrom, Santurkar, Tsipras, Janoos, Rudolph, Madry, "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553, 2020)

Actor-critic variants:

  • Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu, "Asynchronous Methods for Deep Reinforcement Learning" (arXiv 1602.01783, 2016). A3C and A2C
  • Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, Wierstra, "Continuous Control with Deep Reinforcement Learning" (arXiv 1509.02971, 2016). DDPG
  • Haarnoja, Zhou, Abbeel, Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" (arXiv 1801.01290, 2018)

Textbook:

  • Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Chapter 13
  • Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 3

Next Topics

The natural next steps from policy gradients:

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

7

Derived topics

13

+8 more on the derived-topics page.