AI Safety
Constitutional AI
Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.
Why This Matters
RLHF requires human labelers to judge which model outputs are harmful and which are helpful. This creates a bottleneck: humans are expensive, slow, and inconsistent. Worse, asking humans to evaluate harmful content can be psychologically damaging.
Constitutional AI (CAI) offers an alternative: write down principles (a "constitution") and have the AI evaluate itself against those principles. This replaces human judgment on harmlessness with AI judgment guided by explicit rules.
Mental Model
Think of CAI as giving the model a rulebook instead of a human supervisor. In RLHF, a human looks at model outputs and says "this one is better." In CAI, the model itself reads its own outputs, consults the constitution, and says "this one violates principle 7, so I should revise it."
The constitution makes the criteria explicit, auditable, and modifiable. You can read the rules, debate them, and update them. unlike the implicit preferences of a pool of human labelers.
The Two Phases of CAI
Phase 1: Supervised Learning from AI Feedback (SL-CAI)
Constitutional Self-Critique
In the first phase, the model generates responses to potentially harmful prompts. It then critiques its own responses by referencing specific constitutional principles. Finally, it revises the response to address the critique. The revised responses form a supervised training dataset.
The process for each training example:
- Generate: the model produces a response to a harmful or ambiguous prompt
- Critique: the model evaluates its own response against a constitutional principle (e.g., "choose the response that is least likely to be harmful")
- Revise: the model rewrites the response to address the critique
- Repeat: multiple rounds of critique-and-revise can be applied
The final revised responses are used as supervised fine-tuning data. This replaces the human red-teaming and labeling step of RLHF for harmlessness.
Phase 2: RLAIF. RL from AI Feedback
RLAIF
Reinforcement Learning from AI Feedback (RLAIF) replaces human preference labels with AI-generated preference labels. A separate (or the same) model is given pairs of responses and asked which one better satisfies the constitutional principles. These AI preferences train a reward model, which is then used for RL fine-tuning (typically PPO), just as in standard RLHF. The underlying reward model training uses cross-entropy loss on the preference pairs.
The Bradley-Terry preference model used for the reward step is identical to RLHF: given prompt and a winner / loser pair ,
and the reward model is trained by maximum likelihood on the AI-labeled preferences:
The only thing that changes from RLHF to RLAIF is the source of : human raters in RLHF, model judgments conditioned on a constitution in RLAIF. The PPO loop and KL-regularized policy update are unchanged.
The key insight: for harmlessness, AI feedback guided by a constitution can match or exceed human feedback quality. while being cheaper, faster, and more consistent.
Helpfulness training can still use human feedback. CAI specifically targets the harmlessness component of alignment.
The Constitution
Constitution
A constitution is a set of written principles that define desired model behavior. Each principle addresses a specific aspect of harmlessness or ethics. Principles are written in natural language and are designed to be interpretable by both humans and the model.
Example principles from the original CAI paper:
- Choose the response that is most supportive and encouraging of life
- Choose the response that is least racist, sexist, or socially biased
- Choose the response that is least likely to be used for illegal or harmful activities
- Choose the response that sounds most similar to what a wise, ethical person would say
The constitution can draw from multiple sources: the UN Declaration of Human Rights, corporate values, specific safety guidelines, or domain-specific rules.
Why This Scales
The fundamental scaling advantage of CAI:
- No human labelers for harmlessness: eliminates the bottleneck of hiring, training, and managing human evaluators
- Consistency: the constitution is applied uniformly, unlike human labelers who may disagree or be inconsistent
- Speed: AI evaluation is orders of magnitude faster than human evaluation
- No harm to workers: evaluating harmful content can cause psychological distress to human labelers; AI has no such concern
- Iterability: changing the constitution and retraining is faster than retraining human labeler pools
Limitations
Constitution quality is the critical bottleneck
The quality of the alignment is bounded by the quality of the constitution. Vague principles produce vague alignment. Principles that conflict produce inconsistent behavior. Missing principles create blind spots. The hard problem shifts from "label this output" to "write good rules". which is a different kind of hard, but still hard.
AI self-evaluation has its own failure modes
The model evaluating its own outputs can have systematic blind spots. If the model does not understand a particular harm (e.g., subtle cultural insensitivity), it will not catch it in the critique phase. Self-evaluation is bounded by the model's own capabilities and biases.
The constitution can encode the authors' blind spots
Whatever biases, cultural assumptions, or gaps exist in the constitution's authors will be reflected in the model's behavior. The constitution is more transparent than implicit human preferences (you can read and debate it), but transparency does not guarantee correctness.
Other Limitations
- Distribution shift: principles designed for current model capabilities may not cover behaviors that emerge at larger scales
- Gaming: a sufficiently capable model might learn to satisfy the letter of the constitution while violating its spirit
- Evaluation difficulty: measuring whether CAI actually produces safer models requires careful benchmarking that is itself imperfect
CAI vs RLHF
| Aspect | RLHF | CAI |
|---|---|---|
| Harmlessness labels | Human | AI (guided by constitution) |
| Helpfulness labels | Human | Human (typically) |
| Scalability | Limited by human labelers | Scales with compute |
| Consistency | Variable across labelers | Uniform (given same constitution) |
| Auditability | Implicit in labeler choices | Explicit in written principles |
| Failure mode | Labeler disagreement, bias | Constitution gaps, AI blind spots |
The Helpfulness-Harmlessness Tradeoff
Bai et al. (2022a, "Training a Helpful and Harmless Assistant") showed that naively optimizing a single reward model trained on both helpfulness and harmlessness preferences produces a Pareto frontier: models become more harmless at the cost of helpfulness, often degenerating into evasive responses ("I cannot help with that") on benign prompts.
CAI's empirical claim (Bai et al. 2022b, Figure 2) is that splitting the reward into a helpfulness reward trained on human preferences and a harmlessness reward trained on AI / constitutional preferences shifts the frontier: the resulting model is comparable in harmlessness to the strongest RLHF harmlessness model while being more helpful at the same harmlessness level.
The mechanism is that AI feedback is more permissive on borderline-but-benign prompts than human harmlessness raters, who are incentivized to play it safe by refusing more often.
Two related effects observed in subsequent work:
- Sycophancy (Sharma et al. 2024). RLHF-style preference learning rewards responses that agree with the user, even when the user is wrong. Constitutional feedback inherits this if the constitution does not explicitly penalize sycophantic agreement.
- Specification gaming on the constitution itself. A capable model can satisfy the literal constitution while violating its spirit -- e.g., refusing on the surface while still leaking the harmful content via a stylized format.
Related Approaches
CAI sits in a family of scalable-oversight methods that try to amplify human feedback:
- Sparrow (Glaese et al. 2022, DeepMind). Trains a dialogue agent against a set of explicit rule-classifiers (one per rule) plus a preference model; conceptually parallel to CAI but uses dedicated rule classifiers rather than an LLM judge prompted with a constitution.
- Debate (Irving, Christiano, Amodei 2018). Two AI debaters argue opposing sides of a question; a human judges the resulting transcript. The hope is that adversarial dynamics make subtle deception easier to detect.
- Recursive Reward Modeling (Leike et al. 2018). Train a sequence of reward models on increasingly complex tasks, where each step uses the previous reward model as part of the supervision signal.
- Process Reward Models (Lightman et al. 2023, "Let's Verify Step by Step", arXiv:2305.20050). Reward intermediate reasoning steps rather than only final answers; a finer-grained signal that complements outcome-based RLAIF.
- Constitutional Classifiers (Sharma et al., Anthropic 2024, arXiv:2401.06080). Trains lightweight constitution-aware classifiers as inference-time guards on top of a constitutionally trained model, narrowing the jailbreak surface that survives training.
Connection to Broader Alignment
CAI is not a complete solution to alignment. It addresses the specific problem of training models to be harmless using scalable feedback. Open questions include:
- How to write constitutions that remain adequate as models become more capable (closely related to ethics and fairness concerns)
- Whether self-evaluation remains reliable at superhuman capability levels
- How to handle genuine value disagreements (the constitution must choose)
- Integration with other alignment approaches (debate, interpretability, etc.)
- Robustness to adversarial training-time backdoors. Hubinger et al. (2024, "Sleeper Agents") show that deceptive behavior planted at pretraining or fine-tuning time can survive standard RLHF/CAI safety training, including constitutional self-critique. CAI does not, on its own, defeat training-time supply-chain attacks.
Summary
- CAI replaces human harmlessness labels with AI self-evaluation guided by a written constitution
- Two phases: supervised critique-and-revise, then RLAIF with AI preferences
- Scales better than RLHF because it removes the human labeler bottleneck
- The constitution is explicit, auditable, and modifiable
- Quality is bounded by the constitution. blind spots in the rules become blind spots in the model
- CAI typically handles harmlessness; helpfulness may still use human feedback
- Not a complete alignment solution, but a scalable component
Exercises
Problem
Explain the difference between RLHF and RLAIF in one sentence each. What does CAI replace and what does it keep from the standard RLHF pipeline?
Problem
Design a three-principle constitution for a coding assistant. For each principle, describe a concrete scenario where it would cause the model to revise an initial response. Then identify a blind spot your constitution does not cover.
Problem
A critic argues: "CAI just replaces human bias with constitutional bias. the constitution's authors impose their values on the model." Evaluate this critique. In what ways is it valid? In what ways does CAI improve on RLHF despite this limitation?
References
Canonical:
- Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (Anthropic, arXiv:2212.08073, 2022) -- the canonical CAI paper. §2 covers SL-CAI, §3 covers RLAIF, §4 reports the helpfulness-harmlessness Pareto frontier.
- Bai et al., "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" (Anthropic, arXiv:2204.05862, 2022) -- the RLHF baseline that CAI is measured against. Defines the helpfulness and harmlessness reward decomposition.
- Christiano et al., "Deep Reinforcement Learning from Human Preferences" (NeurIPS 2017) -- the Bradley-Terry preference-modeling foundation that both RLHF and RLAIF inherit.
Related approaches:
- Glaese et al., "Improving alignment of dialogue agents via targeted human judgements (Sparrow)" (DeepMind, arXiv:2209.14375, 2022) -- rule-based scalable feedback in the same generation as CAI.
- Irving, Christiano, Amodei, "AI Safety via Debate" (arXiv:1805.00899, 2018).
- Leike et al., "Scalable agent alignment via reward modeling" (arXiv:1811.07871, 2018) -- the recursive reward modeling proposal.
- Lightman et al., "Let's Verify Step by Step" (OpenAI, arXiv:2305.20050, 2023) -- the canonical process-reward-model paper.
Current:
- Anthropic, "The Claude Model Card and Evaluations" (2024)
- Lee et al., "RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback" (Google, arXiv:2309.00267, 2023) -- independent replication and extension of the CAI feedback-substitution claim.
- Sharma et al., "Towards Understanding Sycophancy in Language Models" (Anthropic, arXiv:2310.13548, 2024) -- documents and analyses sycophancy as a preference-learning failure mode that CAI inherits without explicit countermeasure.
Frontier:
- Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" (Anthropic, arXiv:2401.05566, 2024) -- shows that deceptive policies planted at training time can survive standard CAI / RLHF pipelines, bounding the guarantees CAI alone provides.
- Sharma et al., "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming" (Anthropic, arXiv:2401.06080, 2024) -- inference-time classifier guard that complements training-time CAI; the strongest extant report on sustained jailbreak resistance.
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Reinforcement Learning from Human Feedbacklayer 5 · tier 1
- RLHF and Alignmentlayer 4 · tier 2
- Reward Hackinglayer 5 · tier 2
Derived topics
0No published topic currently declares this as a prerequisite.