Reinforcement Learning Basics: MDPs, Policy Gradients, and PPO

Category: alignment Updated: 2026-02-27

PPO (Schulman et al., 2017): clipped surrogate objective prevents destructive policy updates; achieves better sample efficiency than TRPO with simpler implementation; PPO is the standard RL optimizer in RLHF pipelines.

Key Data Points
MeasureValueUnitNotes
PPO clip parameter ε0.2Schulman et al. default; clips policy ratio r(θ) to [1-ε, 1+ε] to prevent large updates
REINFORCE variance reductionBaseline subtractionSubtracting a state-dependent baseline (value function) reduces gradient variance without bias
Discount factor γ in language RL1.0RLHF pipelines typically use γ=1 since reward is only given at end of sequence (episodic)
PPO rollout buffer size2048–8192tokens/stepsTypical RLHF implementations collect this many response tokens before each gradient update
KL penalty coefficient β0.01–0.1β scales the KL divergence from the reference policy in the RLHF reward: R = r_θ − β·KL

Reinforcement learning (RL) provides the mathematical framework for training agents to maximize reward through interaction with an environment. For language model alignment, RL — specifically policy gradient methods and Proximal Policy Optimization (PPO) — is the optimization method used in RLHF to maximize human preference scores.

The MDP Framework for Language Models

In the language model setting, the MDP elements map to:

MDP ConceptLanguage Model Equivalent
State sToken sequence generated so far
Action aNext token to generate (vocab size ~50K)
Transition P(s’|s,a)Deterministic: append a to s
Reward R(s,a)Preference score at end of response
Policy π(a|s)Language model (softmax output)
EpisodeOne complete prompt-response pair

Policy Gradient Theorem

The policy gradient theorem (Sutton & Barto, 2018) provides an unbiased gradient estimate for the expected return:

∇_θ J(θ) = E_π[G_t · ∇_θ log π_θ(a_t|s_t)]

Where G_t = Σ_{k=t}^{T} γ^{k-t} R_k is the discounted return from time t. This is the REINFORCE algorithm (Williams, 1992). The key insight: we can estimate this expectation by sampling trajectories from the current policy and computing the log-probability gradient.

Advantage Estimation

Raw return G_t has high variance. The advantage function subtracts a state-value baseline:

A(s_t, a_t) = G_t - V(s_t)

Where V(s_t) = E_π[G_t | s_t] is the expected return from state s_t (estimated by a learned value network). This reduces variance without introducing bias. In practice, Generalized Advantage Estimation (GAE, Schulman et al.) is used:

A^{GAE}t = Σ_l (γλ)^l δ{t+l}, where δ_t = R_t + γV(s_{t+1}) - V(s_t)

With λ ∈ [0,1] trading bias for variance.

PPO: Proximal Policy Optimization

PPO (Schulman et al., 2017) constrains policy updates to prevent destructive large steps. The clipped surrogate objective:

L^{CLIP}(θ) = E[min(r_t(θ)·A_t, clip(r_t(θ), 1-ε, 1+ε)·A_t)]

Where r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t) is the probability ratio between new and old policy.

r_t(θ) valueA_t > 0 (good action)A_t < 0 (bad action)
r_t = 1.0Gradient appliesGradient applies
r_t = 1.3 (ε=0.2)Clipped (no more update)Not clipped
r_t = 0.7 (ε=0.2)Not clippedClipped (no more update)

The clip prevents the policy from deviating too far from the old policy in a single update step.

RLHF-Specific Reward Formulation

In RLHF, the reward combines the preference score with a KL penalty:

R_total(x, y) = r_φ(x, y) − β · KL[π_θ(·|x) || π_ref(·|x)]

Where:

  • r_φ(x, y) is the learned reward model score
  • β controls the strength of the KL penalty
  • π_ref is the supervised fine-tuned reference policy
  • KL divergence penalizes deviation from the reference policy
β valueEffect
β = 0No KL constraint; policy collapses to reward hacking
β = 0.01Weak regularization; allows large policy changes
β = 0.1Standard; balances reward and policy stability
β = 1.0Strong constraint; limits adaptation to reference

See rlhf for how PPO is applied in the full RLHF pipeline, gradient-descent for the underlying optimization methods, and alignment-problem for why RL is needed for alignment rather than supervised methods alone.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What is the Markov Decision Process formalism?

An MDP is a tuple (S, A, P, R, γ) where S is a state space, A an action space, P(s'|s,a) a transition function, R(s,a) a reward function, and γ ∈ [0,1] a discount factor. An agent observes state s, takes action a, receives reward r, transitions to s', and repeats. The goal is to find a policy π(a|s) that maximizes expected discounted return E[Σ γ^t R_t]. In language model RL: states are token sequences so far, actions are next tokens, reward is the preference score from the reward model.

Why does vanilla policy gradient have high variance and how does PPO fix it?

The REINFORCE gradient estimator ∇J(θ) = E[G_t · ∇log π(a|s)] is unbiased but has high variance because G_t (return) can be large and noisy. PPO addresses this with: (1) advantage estimation using a learned value function V(s) as a baseline (A = G_t - V(s_t)); (2) clipped surrogate objective that bounds the policy update ratio r(θ) = π_θ(a|s)/π_old(a|s) to [1-ε, 1+ε]; (3) multiple gradient steps per rollout batch with early stopping. The clipping prevents catastrophically large policy updates that destabilize training.

What is the credit assignment problem in RL for language models?

In RLHF, a reward signal (human preference score) is given for an entire generated response (often 50–500 tokens). The credit assignment problem asks: which tokens in the response caused the high/low reward? With γ=1 and terminal reward, all tokens in the sequence receive the same return, making it difficult to identify which specific word choices were good or bad. This is why RLHF training is less sample-efficient than supervised learning — the reward signal is sparse and temporally delayed.

← All AI pages · Dashboard