RLHF: Reinforcement Learning from Human Feedback — Reward Model and PPO

Category: alignment Updated: 2026-02-27

RLHF trains a reward model on human pairwise preferences, then optimizes via PPO with KL penalty: R = r_θ(x,y) − β·KL(π_RL || π_SFT); introduced for language models by Stiennon et al. (NeurIPS 2020), extended by InstructGPT (Ouyang et al., 2022).

Key Data Points
MeasureValueUnitNotes
RLHF reward functionR(x,y) = r_θ(x,y) − β·KL(π_RL(y|x) || π_SFT(y|x))r_θ = learned reward model; β = KL penalty coefficient; KL term penalizes divergence from SFT baseline
KL penalty coefficient (β)0.01–0.1Typical range; higher β = more conservative; lower β = more reward optimization
InstructGPT human evaluation85%% preference vs baselineOuyang et al. (2022): labelers preferred InstructGPT-1.3B over GPT-3 175B outputs 85% of the time
Reward model training size~33,000comparison pairsInstructGPT: 33K human pairwise comparisons used to train reward model
SFT warmup dataset~13,000labeled promptsInstructGPT: supervised fine-tuning on 13K high-quality human-written demonstrations first

Reinforcement Learning from Human Feedback (RLHF) is a three-phase training procedure that aligns language model outputs with human preferences using pairwise comparison data and policy gradient optimization. Introduced by Stiennon et al. (2020) for summarization and extended to instruction-following by Ouyang et al. (2022), it has become the dominant method for producing helpful, harmless AI assistants.

The Three Phases

Phase 1: Supervised Fine-Tuning (SFT)

Fine-tune the base pre-trained model on a curated dataset of (prompt, preferred response) pairs:

  • Labelers write high-quality responses to sampled prompts
  • Standard cross-entropy training; typically 1–3 epochs
  • Produces π_SFT: a starting policy for RL optimization

Phase 2: Reward Model Training

Collect human preference comparisons: given prompt x and two responses (y_A, y_B), label which is better.

Reward model objective: maximize P(y_A ≻ y_B | x) = σ(r_θ(x, y_A) − r_θ(x, y_B))

InstructGPT DataCount
SFT demonstrations~13,000
Comparison pairs~33,000
Total prompts~40,000

Phase 3: RL Fine-Tuning with PPO

Optimize π_RL to maximize the augmented reward:

R(x, y) = r_θ(x, y) − β · KL(π_RL(y|x) || π_SFT(y|x))

PPO updates the policy using clipped objectives to prevent large, destabilizing policy updates.

PPO Clip Objective

The PPO loss clips the probability ratio to prevent large updates:

L_CLIP(θ) = E[min(r_t(θ)·Â_t, clip(r_t(θ), 1−ε, 1+ε)·Â_t)]

where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) and ε = 0.2 (typical clipping range).

InstructGPT Results (Ouyang et al., 2022)

EvaluationInstructGPT 1.3BGPT-3 175BWinner
Human preferenceInstructGPT (85% preferred)
Truthfulness (TruthfulQA)41%22%InstructGPT (+19%)
Toxicity (RealToxicityPrompts)~25% reductionInstructGPT
NLP benchmark performanceSlight regressionGPT-3 (RLHF hurts slightly)

The human preference result — that 1.3B RLHF parameters outperform 175B supervised-only parameters — demonstrates that alignment training is highly efficient: a 100× smaller model with better training can be more useful in practice.

See constitutional-ai for a feedback-reduction approach to alignment, reinforcement-learning-basics for the RL foundations, and alignment-problem for the broader context of why alignment is difficult.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What are the three phases of RLHF training?

Phase 1 (SFT): fine-tune the pre-trained language model on a dataset of human-written demonstrations (prompt, response pairs) using standard supervised learning. Phase 2 (Reward model): collect human pairwise comparisons (which response is better?) and train a classifier to predict human preferences. Phase 3 (RL fine-tuning): use PPO to optimize the SFT model to maximize the reward model's score, with a KL divergence penalty to prevent the policy from collapsing to reward-hacking behaviors.

Why is a KL penalty needed in RLHF?

Without the KL penalty, the RL policy can 'reward hack' — finding inputs that fool the reward model into giving high scores without actually being helpful or truthful. The reward model is an imperfect proxy for human preferences and has exploitable weaknesses. The penalty R = r_θ(x,y) − β·KL(π_RL || π_SFT) keeps the optimized policy close to the supervised baseline, limiting how aggressively it can exploit reward model flaws. This is a direct application of Goodhart's Law: when the measure becomes a target, it ceases to be a good measure.

What did Stiennon et al. (2020) demonstrate about RLHF for summarization?

Stiennon et al. trained a reward model on ~64,000 human preference comparisons between TL;DR summaries, then optimized a GPT-3-based summarizer using PPO. The RLHF-optimized model was preferred by human evaluators 65–75% of the time over supervised fine-tuning baselines. This paper established that RLHF could significantly improve human-perceived quality beyond what standard supervised learning achieves.

← All AI pages · Dashboard