Applied ML
Reward Systems and Reinforcement Learning Neuroscience
Dopamine as TD error, distributional RL as a model of dopamine variability, actor-critic mapped to cortex-striatum, and the model-based versus model-free dichotomy in human choice.
Why This Matters
Reinforcement learning is one of the few places where machine learning and neuroscience genuinely converge on the same equations. The Schultz-Dayan-Montague paper (1997, Science 275(5306)) showed that midbrain dopamine neurons encode a temporal-difference error signal almost exactly matching the that Sutton and Barto had proposed on engineering grounds. Two decades later, Dabney et al. (2020, Nature 577(7792)) showed that the heterogeneity of dopamine responses fits distributional RL better than scalar TD. The math the ML community uses to train agents keeps showing up in mouse and primate recordings.
For ML readers this matters in two directions. It is empirical evidence that TD-style algorithms are computationally natural enough that biology rediscovered them. It also gives an outside-view check on RL design choices: if a proposed architecture has no plausible biological substrate, that is not by itself disqualifying, but if a biologically observed mechanism (distributional codes, model-based arbitration) consistently outperforms scalar alternatives in animals, it is a hint about which inductive biases pay off.
Core Ideas
Dopamine as TD error. Schultz and colleagues recorded ventral tegmental area neurons in monkeys during classical conditioning. Before learning, dopamine cells fired on reward delivery. After learning, they fired on the predictive cue and went silent on expected reward; an omitted expected reward caused a phasic dip. This three-way pattern (positive on unexpected reward, zero on expected reward, negative on omitted reward) is the signature of a reward prediction error, exactly from TD learning.
Distributional RL in the brain. Dabney et al. (2020) noted that dopamine cells differ in their reversal points: some respond positively to small rewards, others only to large ones. Interpreted as a population of value predictors with different optimism levels, the population encodes a distribution over returns rather than its mean. This is the same architecture as C51 and QR-DQN. The behavioral signature (asymmetric updating from positive versus negative surprises) shows up in both mice and the trained agents.
Actor-critic mapping. A common functional map: dorsolateral striatum implements the actor (action selection), ventral striatum and orbitofrontal cortex implement the critic (state value), midbrain dopamine carries the TD error that updates both. The cortex-basal-ganglia-thalamus loop closes this in vivo. The mapping is a simplification: parallel loops handle different action sets and timescales, and many circuits do not fit cleanly. As a first-order picture it is useful.
Model-based versus model-free in humans. Daw, Niv, and Dayan (2005, Nat. Neurosci. 8(12)) proposed that the brain runs both a model-based system (forward simulation through a learned transition model, slow but flexible) and a model-free system (cached values updated by TD, fast but inflexible), with arbitration weighted by relative uncertainty. The two-step task (Daw et al. 2011, Neuron 69) operationalized this and showed that human choices typically reflect a mixture, with the model-based weight rising under reflection and falling under stress, time pressure, or in some psychiatric conditions.
Reward Prediction Error
The reward prediction error is the difference between received-plus-predicted future reward and the value expected before acting: . Positive error means the outcome was better than expected; negative error means it was worse.
Dopamine TD-Error Signature
Statement
In a learned conditioning task, a TD-error account predicts dopamine response to the cue, little response to the expected reward, and a dip when the expected reward is omitted.
Intuition
The signal moves from reward delivery to the earliest reliable predictor of reward. Once the reward is fully predicted, receiving it no longer changes the value estimate.
Failure Mode
The pattern is less clean when reward timing is uncertain, when motivational state changes the value scale, or when dopamine is measured too coarsely to isolate the phasic component.
Problem
If a cue predicts reward with high confidence and the reward arrives exactly as expected, what sign should have at reward time?
Common Confusions
Dopamine is not just pleasure
Older accounts treated dopamine as the brain's reward signal in a hedonic sense. The TD-error account is sharper: dopamine encodes the change in expected future reward, not reward itself. A fully predicted reward delivers no dopamine response even though it remains pleasant.
Model-based RL is not always better
In animals and humans, neither system dominates. Model-free is faster and more sample-efficient on stationary tasks; model-based wins when the environment changes or when an action's downstream consequences differ from the cached value. Arbitration based on uncertainty is the empirically supported pattern, and it carries over to ML hybrids like Dyna.
References
Canonical:
- Schultz, Dayan, Montague, "A Neural Substrate of Prediction and Reward," Science 275(5306):1593-1599 (1997)
- Sutton & Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press (2018), Chapter 6
- Niv, "Reinforcement Learning in the Brain," Journal of Mathematical Psychology 53(3):139-154 (2009)
Current:
- Dabney, Kurth-Nelson, Uchida, Starkweather, Hassabis, Munos, Botvinick, "A Distributional Code for Value in Dopamine-Based Reinforcement Learning," Nature 577(7792):671-675 (2020)
- Bellemare, Dabney, Munos, "A Distributional Perspective on Reinforcement Learning," ICML (2017)
- Daw, Niv, Dayan, "Uncertainty-Based Competition Between Prefrontal and Dorsolateral Striatal Systems for Behavioral Control," Nature Neuroscience 8(12):1704-1711 (2005)
- Daw, Gershman, Seymour, Dayan, Dolan, "Model-Based Influences on Humans' Choices and Striatal Prediction Errors," Neuron 69(6):1204-1215 (2011)
Related Topics
Last reviewed: April 27, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Markov Decision Processeslayer 2 · tier 1
- Policy Gradient Theoremlayer 3 · tier 1
- Temporal Difference Learninglayer 2 · tier 2
Derived topics
3- Actor-Critic Methodslayer 3 · tier 2
- Model-Based Reinforcement Learninglayer 3 · tier 2
- Predictive Coding and Autoencoders in the Brainlayer 4 · tier 3
Graph-backed continuations