Question 1

What are the three phases of RLHF training?

Accepted Answer

Phase 1 (SFT): fine-tune the pre-trained language model on a dataset of human-written demonstrations (prompt, response pairs) using standard supervised learning. Phase 2 (Reward model): collect human pairwise comparisons (which response is better?) and train a classifier to predict human preferences. Phase 3 (RL fine-tuning): use PPO to optimize the SFT model to maximize the reward model's score, with a KL divergence penalty to prevent the policy from collapsing to reward-hacking behaviors.

Question 2

Why is a KL penalty needed in RLHF?

Accepted Answer

Without the KL penalty, the RL policy can 'reward hack' — finding inputs that fool the reward model into giving high scores without actually being helpful or truthful. The reward model is an imperfect proxy for human preferences and has exploitable weaknesses. The penalty R = r_θ(x,y) − β·KL(π_RL || π_SFT) keeps the optimized policy close to the supervised baseline, limiting how aggressively it can exploit reward model flaws. This is a direct application of Goodhart's Law: when the measure becomes a target, it ceases to be a good measure.

Question 3

What did Stiennon et al. (2020) demonstrate about RLHF for summarization?

Accepted Answer

Stiennon et al. trained a reward model on ~64,000 human preference comparisons between TL;DR summaries, then optimized a GPT-3-based summarizer using PPO. The RLHF-optimized model was preferred by human evaluators 65–75% of the time over supervised fine-tuning baselines. This paper established that RLHF could significantly improve human-perceived quality beyond what standard supervised learning achieves.

Measure	Value	Unit	Notes
RLHF reward function	R(x,y) = r_θ(x,y) − β·KL(π_RL(y\|x) \|\| π_SFT(y\|x))		r_θ = learned reward model; β = KL penalty coefficient; KL term penalizes divergence from SFT baseline
KL penalty coefficient (β)	0.01–0.1		Typical range; higher β = more conservative; lower β = more reward optimization
InstructGPT human evaluation	85%	% preference vs baseline	Ouyang et al. (2022): labelers preferred InstructGPT-1.3B over GPT-3 175B outputs 85% of the time
Reward model training size	~33,000	comparison pairs	InstructGPT: 33K human pairwise comparisons used to train reward model
SFT warmup dataset	~13,000	labeled prompts	InstructGPT: supervised fine-tuning on 13K high-quality human-written demonstrations first

InstructGPT Data	Count
SFT demonstrations	~13,000
Comparison pairs	~33,000
Total prompts	~40,000

Evaluation	InstructGPT 1.3B	GPT-3 175B	Winner
Human preference	—	—	InstructGPT (85% preferred)
Truthfulness (TruthfulQA)	41%	22%	InstructGPT (+19%)
Toxicity (RealToxicityPrompts)	~25% reduction	—	InstructGPT
NLP benchmark performance	Slight regression	—	GPT-3 (RLHF hurts slightly)

RLHF: Reinforcement Learning from Human Feedback — Reward Model and PPO

The Three Phases

Phase 1: Supervised Fine-Tuning (SFT)

Phase 2: Reward Model Training

Phase 3: RL Fine-Tuning with PPO

PPO Clip Objective

InstructGPT Results (Ouyang et al., 2022)

Related Pages

Sources

Frequently Asked Questions

What are the three phases of RLHF training?

Why is a KL penalty needed in RLHF?

What did Stiennon et al. (2020) demonstrate about RLHF for summarization?