Skip to main content

AI Safety

Catastrophic Forgetting

Fine-tuning a neural network on new data destroys knowledge of old data. The stability-plasticity dilemma, mitigation strategies (EWC, LwF, SI, MAS, replay, progressive networks), and implications for continual learning and LLM fine-tuning.

AdvancedTier 2CurrentSupporting~50 min

Why This Matters

Task 1: animals&?l?h//#+#a*
Task 2: colorsa&|e=?or=&s@~te
Task 3: numberss%\enthr\ee=ght
Task 4: shapesroundsharpcube

Sequential fine-tuning, tasks 1 to 4. As each new task is trained, earlier tasks' learned items degrade. Preserving them requires a mitigation (EWC, replay, capacity allocation).

Catastrophic forgetting is one of the most fundamental limitations of neural networks. When you fine-tune a model on task B, it forgets task A. This is not a minor degradation. Performance on the old task can drop to chance level.

This matters practically because:

  • Fine-tuning LLMs on new instructions can destroy prior capabilities
  • Models cannot learn continuously from a stream of tasks without explicit mitigation
  • Safety alignment via fine-tuning can be undone by subsequent fine-tuning on other data
  • The problem limits how we deploy and update AI systems

Mental Model

Imagine learning to ride a bicycle, then learning to play piano. For humans, learning piano does not make you forget how to ride a bicycle. Neural networks are not like this. Training a network on piano data overwrites the weights that encoded bicycle riding.

The root cause: all knowledge in a neural network is stored in shared weights. Learning task B modifies the same weights that encoded task A via gradient descent. If the two tasks need different weight configurations, learning B destroys A.

Formal Setup and Notation

Consider a neural network with parameters θ\theta. Suppose we have two tasks, A and B, with data DA\mathcal{D}_A and DB\mathcal{D}_B and losses LA(θ)\mathcal{L}_A(\theta) and LB(θ)\mathcal{L}_B(\theta).

Sequential training: First train on DA\mathcal{D}_A to get θA\theta_A^*. Then fine-tune on DB\mathcal{D}_B starting from θA\theta_A^* to get θAB\theta_{AB}.

Catastrophic forgetting: LA(θAB)LA(θA)\mathcal{L}_A(\theta_{AB}) \gg \mathcal{L}_A(\theta_A^*). Performance on task A degrades severely after learning task B.

Core Definitions

Definition

Stability-Plasticity Dilemma

A learning system must balance two competing objectives:

  • Plasticity: The ability to learn new information from new data
  • Stability: The ability to retain previously learned knowledge

Maximum plasticity (unconstrained fine-tuning) causes catastrophic forgetting. Maximum stability (frozen weights) prevents learning anything new. Every continual learning method is a different point on this tradeoff.

Definition

Continual Learning (Lifelong Learning)

A learning setting where a model must learn from a sequence of tasks T1,T2,,TK\mathcal{T}_1, \mathcal{T}_2, \ldots, \mathcal{T}_K without access to data from previous tasks when learning the current task. The model must perform well on all tasks after training on the full sequence.

Definition

Task-Incremental vs. Class-Incremental

  • Task-incremental: The model knows which task it is being evaluated on (e.g., receives a task ID). Easier because the model can route to task-specific components.
  • Class-incremental: The model must distinguish between all classes from all tasks without knowing the task ID. Much harder because old classes must compete with new classes in a shared output space.

Why Forgetting Happens: The Weight-Level View

Consider the loss landscape. After training on task A, the parameters θA\theta_A^* sit in a region of low LA\mathcal{L}_A. When we optimize LB\mathcal{L}_B starting from θA\theta_A^*, the gradient of LB\mathcal{L}_B pushes θ\theta away from θA\theta_A^* toward regions where LB\mathcal{L}_B is low. In general, these regions have high LA\mathcal{L}_A.

The severity of forgetting depends on:

  1. Task similarity: If tasks A and B are similar, the low-loss regions overlap and forgetting is mild. If they are dissimilar, the regions are far apart and forgetting is catastrophic.
  2. Network capacity: Larger networks can potentially represent both tasks in different subnetworks, reducing interference.
  3. Training duration: The longer you train on B, the further θ\theta moves from θA\theta_A^*, and the worse forgetting becomes.

Mitigation Strategies

Regularization-Based: Elastic Weight Consolidation (EWC)

Proposition

Elastic Weight Consolidation

Statement

After learning task A with parameters θA\theta_A^*, the EWC objective for learning task B is:

LEWC(θ)=LB(θ)+λ2iFi(θiθA,i)2\mathcal{L}_{\text{EWC}}(\theta) = \mathcal{L}_B(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{A,i}^*)^2

where FiF_i is the diagonal of the empirical Fisher evaluated at θA\theta_A^*. Labels are sampled from the model's own predictive distribution, not from the data:

F=ExDAEypθA(x) ⁣[θlogpθA(yx)θlogpθA(yx)]F = \mathbb{E}_{x \sim \mathcal{D}_A}\, \mathbb{E}_{y \sim p_{\theta_A^*}(\cdot \mid x)}\!\left[\nabla_\theta \log p_{\theta_A^*}(y \mid x)\, \nabla_\theta \log p_{\theta_A^*}(y \mid x)^\top\right]

EWC uses only the diagonal of this matrix. Kirkpatrick et al. 2017 use this model-sampled form; Kunstner, Balles, and Hennig 2019 show that substituting data labels (the "true Fisher" with empirical expectation over data labels, often also called empirical Fisher in the ML literature) gives a different matrix which is not a Gauss-Newton approximation except near convergence.

Intuition

EWC penalizes moving away from θA\theta_A^*, but not equally for all parameters. Parameters with large FiF_i strongly determine task A predictions; perturbing them changes the model's output distribution and damages task A loss. Parameters with small FiF_i are redundant for task A and can be repurposed for task B.

The Fisher information measures output-distribution curvature in parameter space. EWC uses its diagonal as a per-parameter importance weight.

Why It Matters

EWC can be read as a Laplace approximation to the posterior over θ\theta after task A, with a diagonal Fisher as the covariance inverse. The penalty λ2iFi(θiθA,i)2\tfrac{\lambda}{2}\sum_i F_i(\theta_i - \theta_{A,i}^*)^2 is the log-prior implied by that approximation when used as a prior for task B. The equivalence requires three aggressive approximations: quadratic expansion of the log-likelihood around θA\theta_A^*, a diagonal covariance, and empirical-Fisher \approx Hessian at the optimum. All three fail away from convergence or under strong parameter correlations, so "Bayesian EWC" is better stated as "Laplace plus diagonal approximation" rather than exact posterior update.

Failure Mode

The diagonal assumption ignores parameter correlations; for large networks this is severe. The empirical Fisher equals the Gauss-Newton matrix only at a well-fit optimum, so FF is a poor curvature proxy early in training or under distribution shift (Kunstner-Balles-Hennig 2019). The naive sum over tasks also accumulates: after KK tasks the penalty has KK quadratic terms and every parameter becomes constrained. This is "ossification". Online EWC (Schwarz et al. 2018) addresses this by keeping a single running Fisher estimate with exponential decay rather than one term per task.

Other Regularization Methods

Several methods use the same quadratic-penalty template as EWC but replace the Fisher with a different per-parameter importance score Ωi\Omega_i:

L(θ)=LB(θ)+λ2iΩi(θiθA,i)2\mathcal{L}(\theta) = \mathcal{L}_B(\theta) + \frac{\lambda}{2} \sum_i \Omega_i (\theta_i - \theta_{A,i}^*)^2

  • Synaptic Intelligence (SI) (Zenke, Poole, Ganguli 2017, ICML). Ωi\Omega_i is a path integral: the total contribution of parameter ii to the drop in training loss along the optimization trajectory, normalized by the squared distance traveled. Computed online during training, no post-hoc Fisher pass is needed.
  • Memory Aware Synapses (MAS) (Aljundi et al. 2018, ECCV). Ωi\Omega_i is the expected gradient magnitude of the squared 2\ell_2 norm of the network's output with respect to θi\theta_i. Unsupervised: importance can be estimated without labels.
  • Online EWC (Schwarz et al. 2018, "Progress & Compress"). Replaces the sum of per-task Fisher terms with a single running Fisher updated by exponential moving average, avoiding the ossification problem.

Distillation-Based: Learning without Forgetting (LwF)

LwF (Li and Hoiem 2017, ECCV) keeps no replay buffer and no importance weights. Before training on task B, record the task A model's outputs pθA(yxB)p_{\theta_A^*}(y \mid x_B) on the new task's inputs xBx_B. Then train on task B with an added distillation loss that keeps the new model's predictions on xBx_B close to those stored outputs (typically KL divergence with a temperature). The old model's behavior acts as a soft label; no task A data is stored.

Replay-Based: Experience Replay

Store a small buffer of examples from previous tasks. When learning task B, mix examples from the buffer with task B data. This directly prevents forgetting by keeping old data in the training stream.

Variants:

  • Exact replay: Store and replay actual examples from old tasks. Simple but requires storage and may raise privacy concerns.
  • Generative replay: Train a generative model to produce synthetic examples from old tasks. No storage needed, but the generative model itself can suffer from forgetting.
  • Gradient Episodic Memory (GEM) (Lopez-Paz and Ranzato 2017). Use stored examples to constrain gradients: project the task B gradient so that it does not increase loss on any stored example from task A.
  • A-GEM (Averaged GEM) (Chaudhry et al. 2019, ICLR). Relaxes GEM's per-task constraint to a single averaged constraint over the episodic memory. One quadratic program per step becomes one dot-product check, making the method practical at scale.
  • DER++ (Dark Experience Replay) (Buzzega et al. 2020, NeurIPS). Stores past inputs together with the network's logits at the time they were seen, then combines a replay cross-entropy term with a distillation term on the stored logits. Unifies replay and LwF-style distillation in a single objective.

Architecture-Based: Progressive Neural Networks

Allocate a separate set of parameters for each task. Old parameters are frozen; new parameters are added with lateral connections to old ones.

  • Progressive networks (Rusu et al., 2016): Add a new column of layers for each task, with lateral connections from all previous columns. Forgetting is impossible (old weights are frozen), but the model grows linearly with the number of tasks.
  • PackNet: Prune the network after each task, freeing up capacity for new tasks. Avoids the growth problem but requires a pruning strategy.
  • Supermasks / lottery tickets: Find different binary masks over a shared set of weights for different tasks. Each task uses a different subnet.

Connection to LLM Fine-Tuning

Catastrophic forgetting has direct implications for LLM deployment:

Instruction tuning destroys base capabilities. When you fine-tune a base LLM on instruction-following data, the model can lose some of its original capabilities (e.g., factual knowledge, reasoning). This is why careful data mixing and evaluation on diverse benchmarks is essential during fine-tuning.

Safety alignment can be undone. If a model is aligned via RLHF or DPO to refuse harmful requests, subsequent fine-tuning on non-safety data can undo this alignment. Qi et al. 2024 ("Safety Alignment Should Be Made More Than Just a Few Tokens Deep", arXiv:2406.05946) show that current alignment is largely "shallow", concentrated in the first few response tokens, so even benign fine-tuning can erase refusal behavior by perturbing those tokens. Betley et al. 2025 ("Emergent Misalignment", arXiv:2502.17424) show that narrow fine-tuning on a specific misbehavior (e.g., insecure code) induces broader misaligned behavior across unrelated prompts, a catastrophic-forgetting-style failure of alignment. This motivates research on deeper and more robust alignment that survives fine-tuning.

LoRA and parameter-efficient fine-tuning (PEFT). Methods like LoRA (Low-Rank Adaptation) mitigate forgetting by only modifying a small number of parameters (low-rank updates to attention matrices). The base model weights are frozen, so forgetting of base capabilities is reduced. This is a practical form of the architecture-based approach.

Common Confusions

Watch Out

Forgetting is not the same as overfitting

Overfitting means the model memorizes training data and fails to generalize to test data from the same distribution. Forgetting means the model loses performance on a different task it previously learned. A model can overfit to task B while simultaneously forgetting task A, but these are distinct phenomena with different causes and remedies.

Watch Out

Larger models do not automatically solve forgetting

While larger models have more capacity and can potentially represent multiple tasks, standard fine-tuning still causes forgetting in large models. The issue is not lack of capacity but the optimization process: gradients from the new task push shared weights away from the old optimum. LLMs with hundreds of billions of parameters still suffer from catastrophic forgetting during fine-tuning.

Watch Out

Multitask learning is not the same as continual learning

In multitask learning, you have simultaneous access to data from all tasks and train on them together. This avoids forgetting entirely but requires all data upfront. Continual learning requires learning tasks sequentially without revisiting old data. The distinction is about data availability, not model architecture.

Summary

  • Catastrophic forgetting: fine-tuning on new data destroys old knowledge.
  • Root cause: shared weights store all knowledge; updating for task B overwrites task A.
  • Stability-plasticity dilemma: you cannot maximize both.
  • Regularization: EWC (diagonal empirical Fisher, Laplace), Online EWC, SI (path integral), MAS (output-gradient norm), LwF (distillation).
  • Replay: experience replay, GEM, A-GEM, DER++ (replay plus distillation).
  • Architecture: progressive networks, PackNet, supermasks.
  • LoRA/PEFT: modify few parameters, preserve base model.
  • Safety implication: alignment is often shallow and fine-tuning can undo it (Qi et al. 2024; Betley et al. 2025).

Exercises

ExerciseCore

Problem

Explain why EWC uses the Fisher information matrix rather than a uniform penalty on all parameter changes. What would happen if you used a uniform penalty λ2θθA2\frac{\lambda}{2}\|\theta - \theta_A^*\|^2 instead?

ExerciseAdvanced

Problem

After learning tasks A, B, and C sequentially with EWC, the penalty for task D has three terms (one for each previous task). Explain the "ossification" problem and propose a modification to address it.

ExerciseResearch

Problem

If an adversary can fine-tune a safety-aligned LLM on a small dataset of harmful examples, the alignment may be removed. Propose a defense strategy that makes the alignment robust to fine-tuning, and discuss its limitations.

References

Canonical:

  • McCloskey and Cohen, "Catastrophic Interference in Connectionist Networks" (1989), Psychology of Learning and Motivation 24, pp. 109-165.
  • French, "Catastrophic Forgetting in Connectionist Networks" (1999), Trends in Cognitive Sciences 3(4).
  • Kirkpatrick et al., "Overcoming Catastrophic Forgetting in Neural Networks" (EWC, PNAS 2017), Secs. 2-3.

Regularization and distillation methods:

  • Zenke, Poole, Ganguli, "Continual Learning Through Synaptic Intelligence" (ICML 2017).
  • Li and Hoiem, "Learning without Forgetting" (ECCV 2017; TPAMI 2018).
  • Aljundi et al., "Memory Aware Synapses: Learning What (Not) to Forget" (ECCV 2018).
  • Schwarz et al., "Progress & Compress: A Scalable Framework for Continual Learning" (ICML 2018), for Online EWC.
  • Kunstner, Balles, Hennig, "Limitations of the Empirical Fisher Approximation for Natural Gradient Descent" (NeurIPS 2019).

Replay and memory methods:

  • Lopez-Paz and Ranzato, "Gradient Episodic Memory for Continual Learning" (NeurIPS 2017).
  • Chaudhry et al., "Efficient Lifelong Learning with A-GEM" (ICLR 2019).
  • Buzzega et al., "Dark Experience for General Continual Learning" (DER++, NeurIPS 2020).

Architecture methods:

  • Rusu et al., "Progressive Neural Networks" (arXiv:1606.04671, 2016).
  • Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (ICLR 2022).

Surveys:

  • De Lange et al., "A Continual Learning Survey: Defying Forgetting in Classification Tasks" (TPAMI 2021).
  • Parisi et al., "Continual Lifelong Learning with Neural Networks: A Review" (Neural Networks 113, 2019).

LLM alignment and fine-tuning fragility:

  • Qi et al., "Safety Alignment Should Be Made More Than Just a Few Tokens Deep" (arXiv:2406.05946, 2024).
  • Betley et al., "Emergent Misalignment: Narrow Fine-tuning Can Produce Broadly Misaligned LLMs" (arXiv:2502.17424, 2025).

Next Topics

The natural next steps from catastrophic forgetting:

  • Connections to Bayesian inference and posterior approximation
  • Implications for LLM alignment and safety

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

0

No published topic currently declares this as a prerequisite.