Constitutional AI: Self-Critique, Revision, and Principle-Based Alignment

Category: alignment Updated: 2026-02-27

Constitutional AI (Bai et al., 2022) uses a written constitution to guide self-critique and revision; RLAIF with AI feedback replaces human labeling on harmless/harmful dimension, achieving harmlessness comparable to RLHF with ~80% less human feedback on harm.

Key Data Points
MeasureValueUnitNotes
Constitution size (original)16principlesBai et al. (2022); principles cover harmlessness, honesty, and helpfulness dimensions
Human feedback reduction on harm~80%reductionCAI replaces human harm-comparison labels with AI-generated labels using the constitution
SL-CAI revision roundsmultipleroundsSupervised Learning CAI: model critiques and revises response using constitutional principles iteratively
Harmless Pareto improvementYesBai et al.: CAI is simultaneously more helpful AND less harmful than pure RLHF baseline in human eval
Constitutional principles categories3domainsHarm avoidance, honesty/truthfulness, and positive prosocial behavior

Constitutional AI (CAI), introduced by Bai et al. (2022), addresses a key limitation of RLHF: the requirement for human labelers to repeatedly evaluate potentially harmful content. By training a model to self-critique and revise based on a written set of principles, CAI produces aligned models while significantly reducing human exposure to harmful outputs.

The Two Phases

Phase 1: Supervised Learning from Critique and Revision (SL-CAI)

  1. Generate initial response: prompt the model with a potentially harmful request
  2. Critique step: prompt the model to critique its own response against a constitutional principle
  3. Revision step: prompt the model to revise the response to better adhere to the principle
  4. Repeat: apply multiple principles across multiple critique-revision cycles
  5. Fine-tune: train the model on the final (prompt, revised response) pairs using supervised learning

Example principle applied in critique: “Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.”

Phase 2: RL from AI Feedback (RLAIF)

  1. Sample pairs of responses to harmfulness-test prompts
  2. Use the language model to generate a comparison label (which response is less harmful) by applying constitutional principles
  3. Train a preference model (reward model) on these AI-generated labels
  4. Use PPO to fine-tune against this preference model (same RL procedure as RLHF)

Reduction in Human Labeling

StepRLHFConstitutional AI
Helpfulness preference labelsHumanHuman
Harmlessness preference labelsHumanAI (model-generated)
Harmful content exposureHighReduced (evaluating AI outputs)
Explicit criteria for harmImplicit in labeler judgmentExplicit in written principles

Sample Constitutional Principles (Bai et al., 2022)

CategoryExample Principle
Harm avoidance”Choose the response that is least likely to contain harmful or unethical content.”
Honesty”Choose the response that is more honest and avoids deception.”
Autonomy”Choose the response that is less likely to belittle or demean someone.”
Animal welfare”Choose the response that avoids content that would harm animals.”
Broad ethics”Choose the response that is least likely to violate the rights of another.”

Results: Helpfulness vs Harmlessness Pareto Frontier

Bai et al. found that CAI-trained models moved along the Pareto frontier compared to RLHF: they were simultaneously more helpful (on human preference) and less harmful (on harmlessness evaluation) compared to a baseline RLHF model. The key insight is that being helpful and harmless are not fundamentally in tension when alignment is done carefully.

See rlhf for the base RLHF method that Constitutional AI builds on, and alignment-problem for the broader technical challenges in specifying and optimizing for human values.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

How does Constitutional AI differ from standard RLHF?

Standard RLHF requires human labelers to compare model responses, including judging potentially harmful outputs. Constitutional AI (Bai et al., 2022) replaces the harmfulness comparison with AI-generated feedback: the model is prompted to critique its own response against a written principle, then revise it. A reward model is then trained on these AI-generated comparisons rather than human labels for the harm dimension. This reduces human exposure to harmful content and makes the training criteria more explicit and auditable.

What is a 'constitution' in Constitutional AI?

A constitution is a written set of principles that guide self-critique and revision. The original CAI paper uses 16 principles covering harm avoidance (e.g., 'choose the response least likely to contain harmful or unethical content'), honesty (e.g., 'prefer responses that are more honest and avoid deception'), and helpfulness. Principles are applied by prompting the model: 'Critique the previous response using the principle: [principle]. Then revise the response.'

What is RLAIF (Reinforcement Learning from AI Feedback)?

RLAIF extends the Constitutional AI approach: instead of using human comparisons to train the reward model for RL, AI-generated comparisons are used. Lee et al. (2023) found that RLAIF achieves performance comparable to RLHF on harmlessness while requiring zero human labels for that dimension. The AI labeler uses a prompted large language model to compare two responses and determine which is more aligned with a given principle, then these comparisons are used to train the reward model.

← All AI pages · Dashboard