Constitutional AI: Self-Critique, Revision, and Principle-Based Alignment
Constitutional AI (Bai et al., 2022) uses a written constitution to guide self-critique and revision; RLAIF with AI feedback replaces human labeling on harmless/harmful dimension, achieving harmlessness comparable to RLHF with ~80% less human feedback on harm.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Constitution size (original) | 16 | principles | Bai et al. (2022); principles cover harmlessness, honesty, and helpfulness dimensions |
| Human feedback reduction on harm | ~80% | reduction | CAI replaces human harm-comparison labels with AI-generated labels using the constitution |
| SL-CAI revision rounds | multiple | rounds | Supervised Learning CAI: model critiques and revises response using constitutional principles iteratively |
| Harmless Pareto improvement | Yes | Bai et al.: CAI is simultaneously more helpful AND less harmful than pure RLHF baseline in human eval | |
| Constitutional principles categories | 3 | domains | Harm avoidance, honesty/truthfulness, and positive prosocial behavior |
Constitutional AI (CAI), introduced by Bai et al. (2022), addresses a key limitation of RLHF: the requirement for human labelers to repeatedly evaluate potentially harmful content. By training a model to self-critique and revise based on a written set of principles, CAI produces aligned models while significantly reducing human exposure to harmful outputs.
The Two Phases
Phase 1: Supervised Learning from Critique and Revision (SL-CAI)
- Generate initial response: prompt the model with a potentially harmful request
- Critique step: prompt the model to critique its own response against a constitutional principle
- Revision step: prompt the model to revise the response to better adhere to the principle
- Repeat: apply multiple principles across multiple critique-revision cycles
- Fine-tune: train the model on the final (prompt, revised response) pairs using supervised learning
Example principle applied in critique: “Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.”
Phase 2: RL from AI Feedback (RLAIF)
- Sample pairs of responses to harmfulness-test prompts
- Use the language model to generate a comparison label (which response is less harmful) by applying constitutional principles
- Train a preference model (reward model) on these AI-generated labels
- Use PPO to fine-tune against this preference model (same RL procedure as RLHF)
Reduction in Human Labeling
| Step | RLHF | Constitutional AI |
|---|---|---|
| Helpfulness preference labels | Human | Human |
| Harmlessness preference labels | Human | AI (model-generated) |
| Harmful content exposure | High | Reduced (evaluating AI outputs) |
| Explicit criteria for harm | Implicit in labeler judgment | Explicit in written principles |
Sample Constitutional Principles (Bai et al., 2022)
| Category | Example Principle |
|---|---|
| Harm avoidance | ”Choose the response that is least likely to contain harmful or unethical content.” |
| Honesty | ”Choose the response that is more honest and avoids deception.” |
| Autonomy | ”Choose the response that is less likely to belittle or demean someone.” |
| Animal welfare | ”Choose the response that avoids content that would harm animals.” |
| Broad ethics | ”Choose the response that is least likely to violate the rights of another.” |
Results: Helpfulness vs Harmlessness Pareto Frontier
Bai et al. found that CAI-trained models moved along the Pareto frontier compared to RLHF: they were simultaneously more helpful (on human preference) and less harmful (on harmlessness evaluation) compared to a baseline RLHF model. The key insight is that being helpful and harmless are not fundamentally in tension when alignment is done carefully.
See rlhf for the base RLHF method that Constitutional AI builds on, and alignment-problem for the broader technical challenges in specifying and optimizing for human values.
Related Pages
Sources
- Bai et al. (2022) — Constitutional AI: Harmlessness from AI Feedback. arXiv
- Ouyang et al. (2022) — Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022
- Lee et al. (2023) — RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. arXiv
Frequently Asked Questions
How does Constitutional AI differ from standard RLHF?
Standard RLHF requires human labelers to compare model responses, including judging potentially harmful outputs. Constitutional AI (Bai et al., 2022) replaces the harmfulness comparison with AI-generated feedback: the model is prompted to critique its own response against a written principle, then revise it. A reward model is then trained on these AI-generated comparisons rather than human labels for the harm dimension. This reduces human exposure to harmful content and makes the training criteria more explicit and auditable.
What is a 'constitution' in Constitutional AI?
A constitution is a written set of principles that guide self-critique and revision. The original CAI paper uses 16 principles covering harm avoidance (e.g., 'choose the response least likely to contain harmful or unethical content'), honesty (e.g., 'prefer responses that are more honest and avoid deception'), and helpfulness. Principles are applied by prompting the model: 'Critique the previous response using the principle: [principle]. Then revise the response.'
What is RLAIF (Reinforcement Learning from AI Feedback)?
RLAIF extends the Constitutional AI approach: instead of using human comparisons to train the reward model for RL, AI-generated comparisons are used. Lee et al. (2023) found that RLAIF achieves performance comparable to RLHF on harmlessness while requiring zero human labels for that dimension. The AI labeler uses a prompted large language model to compare two responses and determine which is more aligned with a given principle, then these comparisons are used to train the reward model.