RL Theory
Policy Gradient Theorem
The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.
Prerequisites
Why This Matters
The Bellman equations give us a clean route to optimal control when the state space is small and dynamic programming is tractable. That is not the regime that matters most in modern RL. Atari from pixels, continuous robot control, large recommender systems, and RLHF for language models all require optimizing a parameterized policy directly.
In these settings, we parameterize the policy as (e.g., a neural network) and optimize the parameters by gradient ascent on the expected return. The policy gradient theorem tells us how to compute this gradient, even though the expected return depends on the entire trajectory distribution, which itself depends on in a complicated way.
This theorem is the foundation of REINFORCE, actor-critic methods, PPO, and the RLHF pipeline used to train modern language models.
Mental Model
We want to maximize , the expected total reward under the policy. The challenge is that changing changes the probability of every trajectory , which is a product of many policy and transition probabilities.
The log-derivative trick resolves this: instead of differentiating through the trajectory distribution directly, we express the gradient as an expectation under the current policy. This means we can estimate the gradient from sampled experience; no differentiable environment model is required.
The theorem packages two effects into one expectation:
- the discounted occupancy says which states the current policy actually visits;
- the score term says how a parameter change reweights actions inside one visited state.
The non-obvious part is that the derivative of the state distribution does not disappear. It is absorbed into the visitation weights, which is why the final formula can be estimated from on-policy rollouts without differentiating the environment.
Formal Setup and Notation
We work in the MDP framework with a parameterized stochastic policy . A trajectory is .
Policy Objective
The policy objective is the expected discounted return:
where is the discounted state-visitation distribution. This is the distribution that makes the Sutton et al. (2000) policy gradient theorem exact. In most deep-RL implementations the gradient is estimated using the undiscounted empirical distribution of states visited along sampled trajectories, which introduces a well-documented bias. See Thomas (2014), "Bias in Natural Actor-Critic Algorithms," and Nota and Thomas (2020), "Is the Policy Gradient a Gradient?"
Advantage Function
The advantage function measures how much better action is compared to the average action under :
By construction, for all .
Main Theorems
Policy Gradient Theorem
Statement
The gradient of the policy objective is:
Equivalently, summing over time steps along a trajectory:
Intuition
points in the direction that increases the probability of action in state . The theorem says: weight this direction by how good that action is (-value), and average over the states and actions you actually visit. Actions that lead to high returns get their probabilities increased; actions that lead to low returns get decreased.
Proof Sketch
Write and differentiate. Using :
The term has no -dependence, so . Substituting gives a one-step recursion on .
Unrolling the recursion yields
where is an unnormalized discounted visit count. The normalization is exact: , since . The factor matters. Sutton, McAllester, Singh, Mansour 2000 state it explicitly, but many practical implementations drop it and use an estimator based on the undiscounted empirical state distribution, which is biased. See the Thomas (2014) and Nota and Thomas (2020) references above.
What this calculation delivers: the messy and terms telescope into the discounted state-visitation weights; the gradient acts on only.
Now apply the log-derivative trick and rescale by to the distribution :
The log-derivative step is what turns a sum over the discrete action set into an expectation you can estimate with samples, which is why this theorem is the starting point for every practical policy-gradient algorithm.
Why It Matters
This theorem makes policy optimization tractable. To estimate , sample trajectories under , compute by backpropagating through the policy network, multiply by a return estimate, and average. No environment model is needed. This is the basis of all policy gradient algorithms.
Failure Mode
The naive estimator has extremely high variance. A single trajectory gives a very noisy gradient estimate. Variance reduction techniques (baselines, advantage estimation) are essential for practical use.
Baseline Does Not Change the Gradient
Statement
For any function that depends only on the state:
Therefore, replacing with in the policy gradient does not change its expectation but reduces variance.
Intuition
This is the expected grad-log-prob (EGLP) lemma: for any (a standard property of score functions in any parametric family with differentiable density). Multiplying by a state-only function pulls out of the expectation, so it still evaluates to zero. Subtracting a state-dependent constant from the reward therefore does not bias the gradient, but it does change the variance. is usually a good baseline because it centers the returns around zero.
Proof Sketch
The middle step uses the EGLP identity , which is exactly why the score function has zero mean.
Why It Matters
This is why subtracting the value function baseline to form the advantage is everywhere in modern RL. The baseline is . The advantage is what remains after subtracting it. Using the advantage instead of the raw -value dramatically reduces variance, making practical training possible. The value function is the standard practical baseline, not the variance-minimizing one. The exact minimum-variance baseline is derived in the exercises.
Where the Variance Comes From
The theorem gives an unbiased gradient identity; it does not give a low-variance estimator. Most practical policy-gradient engineering is about removing variance without quietly changing the objective.
| Noise source | What it does to the estimator | Standard fix |
|---|---|---|
| action sampling | Different sampled actions produce different score vectors even in the same state | larger batches, lower-entropy policies late in training |
| long-horizon returns | A reward at time perturbs the update for many earlier actions | reward-to-go, baselines, GAE |
| state visitation randomness | Rare but important states are sampled unevenly across rollouts | on-policy data refresh, trust-region updates, better exploration |
| critic approximation error | replacing returns with lowers variance but can add bias | value-loss tuning, target stabilization, advantage normalization |
This is why the clean theorem lives next to messy algorithms. The identity is simple; the estimator quality depends on how much of this noise you remove.
Five Valid Weights for the Score Function
The policy gradient takes the general form
and there are several valid choices for the weight that share the same expectation but differ wildly in variance. Spinning Up enumerates five:
- Full return of the trajectory: . The naive REINFORCE weighting. Unbiased but very high variance.
- Reward-to-go: . Drops rewards received before time , which cannot depend on action .
- Reward-to-go minus a state baseline: . Valid for any by the EGLP lemma; is the standard choice.
- Action-value function: . This is the exact on-policy weight that emerges from the theorem.
- Advantage function: . Case 4 with the value-function baseline. This is what almost every modern algorithm (A2C, PPO, TRPO, SAC with actor updates) actually uses.
All five are unbiased estimators of the policy gradient. Variance drops sharply from form 1 to form 5 because each step removes a zero-mean noise source.
Don't let the past distract you
Why is the reward-to-go (form 2) better than the full return (form 1)? Both are unbiased. But rewards received before time are independent of the action taken at time , so by the EGLP lemma. Past rewards contribute zero to the expectation but nonzero to the variance of the sample estimate. Dropping them is pure variance reduction, no bias introduced. Spinning Up calls this "don't let the past distract you."
The REINFORCE Algorithm
The simplest policy gradient algorithm. Sample a trajectory under , then update:
where is the return from time .
REINFORCE is unbiased but has high variance. In practice, subtract a baseline:
Actor-Critic Methods
Instead of waiting for a full trajectory to compute , use a learned value function as both a baseline and a bootstrap target:
This is the TD(0) advantage estimate, the one-step TD residual . Update the policy (actor) using:
Update the value function (critic) by minimizing or using TD learning. Actor-critic trades some bias (from bootstrapping) for dramatically lower variance compared to REINFORCE. The 1-step residual and the Monte Carlo return are two ends of a spectrum; n-step returns and the GAE estimator below interpolate between them.
Natural Policy Gradient
The natural gradient is due to Amari (1998), "Natural Gradient Works Efficiently in Learning," which applies information geometry to parameter estimation. Kakade (2001), "A Natural Policy Gradient," transferred the idea to RL by identifying the policy Fisher information as the correct metric on policy space.
Natural Policy Gradient
Statement
The Fisher information matrix of the policy is:
The natural policy gradient is:
This is the steepest ascent direction in the KL-divergence metric on policy space, rather than in Euclidean parameter space.
Intuition
The standard gradient depends on the parameterization: reparameterizing changes the gradient direction. The natural gradient is invariant to reparameterization. It asks: what is the best policy update of a given KL size? This is the right question because we care about how much the policy changes, not how much the parameters change.
Why It Matters
The natural policy gradient leads to more stable updates and is the theoretical foundation for TRPO. PPO shares the KL-bounded surrogate framing but is a first-order method, not a natural-gradient approximation.
The point of the explorer is simple: Euclidean parameter distance is chart dependent, but KL distance in policy space is not. Natural gradient chooses the update that is locally steepest in the geometry that the policy itself induces.
Connection to Trust Regions
The natural policy gradient motivates trust region methods:
TRPO (Trust Region Policy Optimization), Schulman et al. (2015), arXiv 1502.05477. Maximize the surrogate objective subject to . The step direction is computed by linearizing the objective and quadratically approximating the KL constraint, which reduces to solving for the natural-gradient direction. TRPO solves this linear system with conjugate gradients using Fisher-vector products (no explicit Fisher matrix is formed), then performs a backtracking line search along that direction to enforce the exact KL bound and verify surrogate improvement. The CG plus line-search structure is what makes TRPO a trust-region method, not a pure natural-gradient step.
PPO (Proximal Policy Optimization), Schulman et al. (2017). Replace the hard KL constraint with a clipped surrogate:
where . PPO is a first-order method that takes Adam steps on with standard backpropagation. It is not an approximate natural-gradient method, despite a common misreading. Engstrom et al. (2020), "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729), and Ilyas et al. (2020), "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553), show that PPO's empirical advantage over vanilla policy gradient comes substantially from implementation details: value-function clipping, reward and advantage normalization, orthogonal weight initialization, learning-rate annealing, and observation normalization. The clipped objective alone does not account for PPO's reported gains. PPO is the algorithm used in the RLHF pipeline for language model training, but using it should not be confused with approximating the natural gradient.
Generalized Advantage Estimation
REINFORCE uses the full Monte Carlo return . The 1-step TD advantage uses . These are two endpoints of a spectrum. n-step returns sit between them:
Schulman et al. (2016), "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438), introduce the GAE estimator as an exponentially weighted average over n-step advantages:
Setting recovers the 1-step TD advantage (low variance, high bias from ). Setting recovers Monte Carlo with a value baseline (unbiased, high variance). Intermediate trades bias against variance. GAE is the standard advantage estimator used with PPO and TRPO in practice.
What the Theorem Does Not Give You for Free
Action-dependent baselines need a correction term
The state-only baseline result is exact because . If you replace with an action-dependent term , that identity fails in general. The estimator becomes biased unless you add an explicit correction. This is why the standard theorem is careful about state-only baselines.
The theorem is on-policy
The expectation in the policy gradient theorem is under the current policy's occupancy measure. If the data come from an older policy or a replay buffer, you are no longer estimating the same expectation unless you add importance weights or use an off-policy actor-critic derivation. Off-policy data reuse is possible, but it is not a free corollary of the theorem.
Actor-Critic Variants
Beyond the vanilla actor-critic above, several variants are in common use:
- A2C and A3C (Mnih et al. 2016, arXiv 1602.01783). Synchronous and asynchronous parallel actor-learners sharing a global network. A3C predates GPU-dominant training; A2C is the synchronous equivalent that is typically preferred on modern hardware.
- DPG and DDPG. The deterministic policy gradient theorem (Silver et al. 2014) extends the stochastic PG theorem to deterministic policies : the gradient becomes . DDPG (Lillicrap et al. 2016, arXiv 1509.02971) is the deep, off-policy actor-critic instantiation for continuous control.
- SAC (Haarnoja et al. 2018, arXiv 1801.01290). Soft Actor-Critic adds an entropy bonus to the reward, optimizing . Off-policy, stochastic, and the dominant algorithm on many continuous-control benchmarks.
Why the Log-Derivative Trick Works
The identity converts a gradient of a probability into an expectation under that probability. This is the same trick used in variational inference (the ELBO gradient) and in score-based diffusion models. It works whenever you need to differentiate an expectation with respect to parameters of the distribution.
The key property is that . The score function has zero mean. This is what makes baselines valid.
Common Confusions
Policy gradients are not backpropagation through the environment
A common misconception is that policy gradients differentiate through the environment dynamics. They do not. The gradient goes only through . The policy network. The environment is treated as a black box that produces rewards and next states. This is why policy gradients work in environments where the dynamics are unknown.
REINFORCE is unbiased but not practical without variance reduction
The raw REINFORCE estimator has the correct expectation but enormous variance. A single trajectory might have a return of 100 or -50 depending on luck. Without a baseline, you need an impractical number of samples. The advantage function and multiple-sample estimators are not optional extras; they are necessary for the algorithm to work at all.
Natural gradient is not just preconditioning
While the natural gradient can be seen as preconditioning the gradient by , its justification is geometric: it is the steepest ascent direction when distance is measured by KL divergence between policies, not Euclidean distance between parameters. This distinction matters because the same KL change can correspond to very different parameter changes depending on the region of parameter space.
Summary
- The policy gradient is
- The log-derivative trick converts a gradient of an expectation into an expectation of a gradient
- Subtracting a state-dependent baseline does not change the expected gradient but reduces variance
- The value function is the canonical baseline. The advantage is what remains after subtraction, not the baseline itself
- The exact variance-minimizing baseline is a -weighted average of -values (Weaver and Tao 2001, Greensmith et al. 2004), not
- Actor-critic: learn as a baseline, bootstrap to reduce variance at the cost of some bias
- GAE interpolates between 1-step TD advantage and the full Monte Carlo return
- Natural policy gradient (Amari 1998, Kakade 2001) uses Fisher information to make updates invariant to parameterization
- TRPO (Schulman et al. 2015) is a practical natural-gradient method using conjugate gradients and a backtracking line search
- PPO is a first-order clipped-objective method, not a natural-gradient approximation; its empirical gains come substantially from implementation details
Exercises
Problem
Show that for any state , assuming for all .
Problem
For a policy that is softmax over two actions with logits , compute .
Problem
Show that replacing the full trajectory return with the reward-to-go leaves the REINFORCE estimator unbiased. In other words, prove that rewards received before time do not contribute to the expectation of the -th score term.
Problem
Prove that the variance-minimizing baseline for the REINFORCE estimator is (a weighted average of -values, not simply ).
Related Comparisons
References
Canonical:
- Sutton, McAllester, Singh, Mansour, "Policy Gradient Methods for RL with Function Approximation" (2000)
- Williams, "Simple Statistical Gradient-Following Algorithms for Connectionist RL" (1992). original REINFORCE
- Amari, "Natural Gradient Works Efficiently in Learning," Neural Computation (1998). origin of the natural gradient
- Kakade, "A Natural Policy Gradient" (2001). natural gradient transferred to RL
- Silver, Lever, Heess, Degris, Wierstra, Riedmiller, "Deterministic Policy Gradient Algorithms" (2014). DPG theorem
Variance reduction and advantage estimation:
- Weaver and Tao, "The Optimal Reward Baseline for Gradient-Based Reinforcement Learning" (2001)
- Greensmith, Bartlett, Baxter, "Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning" (2004). exact minimum-variance baseline
- Schulman, Moritz, Levine, Jordan, Abbeel, "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (arXiv 1506.02438, 2016). GAE
- Thomas, "Bias in Natural Actor-Critic Algorithms" (2014). discounted vs undiscounted state distribution
- Nota and Thomas, "Is the Policy Gradient a Gradient?" (2020)
Trust-region and PPO:
- Schulman, Levine, Abbeel, Jordan, Moritz, "Trust Region Policy Optimization" (arXiv 1502.05477, 2015). TRPO with CG and line search
- Schulman, Wolski, Dhariwal, Radford, Klimov, "Proximal Policy Optimization Algorithms" (arXiv 1707.06347, 2017)
- Engstrom, Ilyas, Santurkar, Tsipras, Janoos, Rudolph, Madry, "Implementation Matters in Deep Policy Gradients" (arXiv 2005.12729, 2020)
- Ilyas, Engstrom, Santurkar, Tsipras, Janoos, Rudolph, Madry, "A Closer Look at Deep Policy Gradients" (arXiv 1811.02553, 2020)
Actor-critic variants:
- Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu, "Asynchronous Methods for Deep Reinforcement Learning" (arXiv 1602.01783, 2016). A3C and A2C
- Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, Wierstra, "Continuous Control with Deep Reinforcement Learning" (arXiv 1509.02971, 2016). DDPG
- Haarnoja, Zhou, Abbeel, Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" (arXiv 1801.01290, 2018)
Textbook:
- Sutton and Barto, Reinforcement Learning: An Introduction (2nd ed., 2018), Chapter 13
- Agarwal, Jiang, Kakade, Sun, Reinforcement Learning: Theory and Algorithms (2022), Chapter 3
Next Topics
The natural next steps from policy gradients:
- Actor-Critic Methods: the variance-reduced family that turns the baseline into a learned critic
- Policy Optimization: PPO and TRPO: trust-region and clipped-surrogate descendants of the theorem
- RLHF and alignment: how policy-gradient ideas reappear in preference optimization and PPO-style post-training
Last reviewed: April 23, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
7- Convex Optimization Basicslayer 1 · tier 1
- Markov Decision Processeslayer 2 · tier 1
- Q-Learninglayer 2 · tier 1
- Value Iteration and Policy Iterationlayer 2 · tier 1
- Multi-Armed Bandits Theorylayer 2 · tier 2
Derived topics
13- Reinforcement Learning from Human Feedbacklayer 5 · tier 1
- Actor-Critic Methodslayer 3 · tier 2
- DDPG: Deep Deterministic Policy Gradientlayer 3 · tier 2
- Policy Optimization: PPO and TRPOlayer 3 · tier 2
- TD3: Twin Delayed Deep Deterministic Policy Gradientlayer 3 · tier 2
+8 more on the derived-topics page.
Graph-backed continuations