Question 1

How does Constitutional AI differ from standard RLHF?

Accepted Answer

Standard RLHF requires human labelers to compare model responses, including judging potentially harmful outputs. Constitutional AI (Bai et al., 2022) replaces the harmfulness comparison with AI-generated feedback: the model is prompted to critique its own response against a written principle, then revise it. A reward model is then trained on these AI-generated comparisons rather than human labels for the harm dimension. This reduces human exposure to harmful content and makes the training criteria more explicit and auditable.

Question 2

What is a 'constitution' in Constitutional AI?

Accepted Answer

A constitution is a written set of principles that guide self-critique and revision. The original CAI paper uses 16 principles covering harm avoidance (e.g., 'choose the response least likely to contain harmful or unethical content'), honesty (e.g., 'prefer responses that are more honest and avoid deception'), and helpfulness. Principles are applied by prompting the model: 'Critique the previous response using the principle: [principle]. Then revise the response.'

Question 3

What is RLAIF (Reinforcement Learning from AI Feedback)?

Accepted Answer

RLAIF extends the Constitutional AI approach: instead of using human comparisons to train the reward model for RL, AI-generated comparisons are used. Lee et al. (2023) found that RLAIF achieves performance comparable to RLHF on harmlessness while requiring zero human labels for that dimension. The AI labeler uses a prompted large language model to compare two responses and determine which is more aligned with a given principle, then these comparisons are used to train the reward model.

Measure	Value	Unit	Notes
Constitution size (original)	16	principles	Bai et al. (2022); principles cover harmlessness, honesty, and helpfulness dimensions
Human feedback reduction on harm	~80%	reduction	CAI replaces human harm-comparison labels with AI-generated labels using the constitution
SL-CAI revision rounds	multiple	rounds	Supervised Learning CAI: model critiques and revises response using constitutional principles iteratively
Harmless Pareto improvement	Yes		Bai et al.: CAI is simultaneously more helpful AND less harmful than pure RLHF baseline in human eval
Constitutional principles categories	3	domains	Harm avoidance, honesty/truthfulness, and positive prosocial behavior

Step	RLHF	Constitutional AI
Helpfulness preference labels	Human	Human
Harmlessness preference labels	Human	AI (model-generated)
Harmful content exposure	High	Reduced (evaluating AI outputs)
Explicit criteria for harm	Implicit in labeler judgment	Explicit in written principles

Category	Example Principle
Harm avoidance	”Choose the response that is least likely to contain harmful or unethical content.”
Honesty	”Choose the response that is more honest and avoids deception.”
Autonomy	”Choose the response that is less likely to belittle or demean someone.”
Animal welfare	”Choose the response that avoids content that would harm animals.”
Broad ethics	”Choose the response that is least likely to violate the rights of another.”

Constitutional AI: Self-Critique, Revision, and Principle-Based Alignment

The Two Phases

Phase 1: Supervised Learning from Critique and Revision (SL-CAI)

Phase 2: RL from AI Feedback (RLAIF)

Reduction in Human Labeling

Sample Constitutional Principles (Bai et al., 2022)

Results: Helpfulness vs Harmlessness Pareto Frontier

Related Pages

Sources

Frequently Asked Questions

How does Constitutional AI differ from standard RLHF?

What is a 'constitution' in Constitutional AI?

What is RLAIF (Reinforcement Learning from AI Feedback)?