Skip to main content

Applied ML

Reward Systems and Reinforcement Learning Neuroscience

Dopamine as TD error, distributional RL as a model of dopamine variability, actor-critic mapped to cortex-striatum, and the model-based versus model-free dichotomy in human choice.

AdvancedTier 3StableReference~15 min

Why This Matters

Reinforcement learning is one of the few places where machine learning and neuroscience genuinely converge on the same equations. The Schultz-Dayan-Montague paper (1997, Science 275(5306)) showed that midbrain dopamine neurons encode a temporal-difference error signal almost exactly matching the δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) that Sutton and Barto had proposed on engineering grounds. Two decades later, Dabney et al. (2020, Nature 577(7792)) showed that the heterogeneity of dopamine responses fits distributional RL better than scalar TD. The math the ML community uses to train agents keeps showing up in mouse and primate recordings.

For ML readers this matters in two directions. It is empirical evidence that TD-style algorithms are computationally natural enough that biology rediscovered them. It also gives an outside-view check on RL design choices: if a proposed architecture has no plausible biological substrate, that is not by itself disqualifying, but if a biologically observed mechanism (distributional codes, model-based arbitration) consistently outperforms scalar alternatives in animals, it is a hint about which inductive biases pay off.

Core Ideas

Dopamine as TD error. Schultz and colleagues recorded ventral tegmental area neurons in monkeys during classical conditioning. Before learning, dopamine cells fired on reward delivery. After learning, they fired on the predictive cue and went silent on expected reward; an omitted expected reward caused a phasic dip. This three-way pattern (positive on unexpected reward, zero on expected reward, negative on omitted reward) is the signature of a reward prediction error, exactly δt\delta_t from TD learning.

Distributional RL in the brain. Dabney et al. (2020) noted that dopamine cells differ in their reversal points: some respond positively to small rewards, others only to large ones. Interpreted as a population of value predictors with different optimism levels, the population encodes a distribution over returns rather than its mean. This is the same architecture as C51 and QR-DQN. The behavioral signature (asymmetric updating from positive versus negative surprises) shows up in both mice and the trained agents.

Actor-critic mapping. A common functional map: dorsolateral striatum implements the actor (action selection), ventral striatum and orbitofrontal cortex implement the critic (state value), midbrain dopamine carries the TD error that updates both. The cortex-basal-ganglia-thalamus loop closes this in vivo. The mapping is a simplification: parallel loops handle different action sets and timescales, and many circuits do not fit cleanly. As a first-order picture it is useful.

Model-based versus model-free in humans. Daw, Niv, and Dayan (2005, Nat. Neurosci. 8(12)) proposed that the brain runs both a model-based system (forward simulation through a learned transition model, slow but flexible) and a model-free system (cached values updated by TD, fast but inflexible), with arbitration weighted by relative uncertainty. The two-step task (Daw et al. 2011, Neuron 69) operationalized this and showed that human choices typically reflect a mixture, with the model-based weight rising under reflection and falling under stress, time pressure, or in some psychiatric conditions.

Definition

Reward Prediction Error

The reward prediction error is the difference between received-plus-predicted future reward and the value expected before acting: δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t). Positive error means the outcome was better than expected; negative error means it was worse.

Proposition

Dopamine TD-Error Signature

Statement

In a learned conditioning task, a TD-error account predicts dopamine response to the cue, little response to the expected reward, and a dip when the expected reward is omitted.

Intuition

The signal moves from reward delivery to the earliest reliable predictor of reward. Once the reward is fully predicted, receiving it no longer changes the value estimate.

Failure Mode

The pattern is less clean when reward timing is uncertain, when motivational state changes the value scale, or when dopamine is measured too coarsely to isolate the phasic component.

ExerciseCore

Problem

If a cue predicts reward with high confidence and the reward arrives exactly as expected, what sign should δt\delta_t have at reward time?

Common Confusions

Watch Out

Dopamine is not just pleasure

Older accounts treated dopamine as the brain's reward signal in a hedonic sense. The TD-error account is sharper: dopamine encodes the change in expected future reward, not reward itself. A fully predicted reward delivers no dopamine response even though it remains pleasant.

Watch Out

Model-based RL is not always better

In animals and humans, neither system dominates. Model-free is faster and more sample-efficient on stationary tasks; model-based wins when the environment changes or when an action's downstream consequences differ from the cached value. Arbitration based on uncertainty is the empirically supported pattern, and it carries over to ML hybrids like Dyna.

References

Canonical:

  • Schultz, Dayan, Montague, "A Neural Substrate of Prediction and Reward," Science 275(5306):1593-1599 (1997)
  • Sutton & Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press (2018), Chapter 6
  • Niv, "Reinforcement Learning in the Brain," Journal of Mathematical Psychology 53(3):139-154 (2009)

Current:

  • Dabney, Kurth-Nelson, Uchida, Starkweather, Hassabis, Munos, Botvinick, "A Distributional Code for Value in Dopamine-Based Reinforcement Learning," Nature 577(7792):671-675 (2020)
  • Bellemare, Dabney, Munos, "A Distributional Perspective on Reinforcement Learning," ICML (2017)
  • Daw, Niv, Dayan, "Uncertainty-Based Competition Between Prefrontal and Dorsolateral Striatal Systems for Behavioral Control," Nature Neuroscience 8(12):1704-1711 (2005)
  • Daw, Gershman, Seymour, Dayan, Dolan, "Model-Based Influences on Humans' Choices and Striatal Prediction Errors," Neuron 69(6):1204-1215 (2011)

Related Topics

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3