Alignment is the process of tuning language models to follow human preferences — producing helpful, harmless, and honest outputs. This doc covers the theory and mathematics behind the major alignment methods.
Pretrained LLMs optimize for next-token prediction, not for following instructions or being safe. A model that maximizes P(next_token | context) may generate toxic content, hallucinate confidently, or refuse to help. Alignment bridges the gap between “predicts text well” and “behaves as intended.”
Before alignment, models are typically fine-tuned on demonstration data:
Input: instruction-response pairs from human annotators
Output: model that follows instructions in the demonstrated format
SFT provides the behavioral foundation — the model learns what good responses look like. Alignment then refines this using preference signals: which of two responses is better.
Collect human preferences: given a prompt and two completions, which is better?
The reward model r(x, y) is trained to predict these preferences using the Bradley-Terry model:
P(y_1 > y_2 | x) = σ(r(x, y_1) - r(x, y_2))
Loss function:
L_RM = -E_{(x, y_w, y_l) ~ D}[log σ(r(x, y_w) - r(x, y_l))]
The reward model is typically the same architecture as the LLM but with a scalar output head instead of a vocabulary head.
Maximize expected reward while staying close to the SFT policy:
max_θ E_{x~D, y~π_θ}[r(x, y)] - β · KL(π_θ || π_SFT)
PPO (Proximal Policy Optimization) implements this with clipped surrogate objectives:
L_PPO = E[min(ratio · A, clip(ratio, 1-ε, 1+ε) · A)]
Where ratio = π_θ(a|s) / π_old(a|s) and A is the advantage estimate.
Components in memory during RLHF training:
This is why RLHF is memory-intensive: four models must coexist, often requiring multi-GPU setups.
DPO reformulates the RLHF objective to eliminate the reward model entirely. Key insight: the optimal policy under the KL-constrained reward maximization objective has a closed-form relationship with the reward:
r(x, y) = β · log(π_θ(y|x) / π_ref(y|x)) + C(x)
Substituting back into the Bradley-Terry preference model:
L_DPO = -E_{(x, y_w, y_l)}[log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))]
Advantages over RLHF:
Disadvantages:
Addresses DPO’s overfitting to preference data by adding a regularization term:
L_IPO = E[(log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x) - 1/(2β))²]
Works with unpaired binary feedback (thumbs up/down) instead of pairwise preferences. Based on prospect theory — humans are more sensitive to losses than gains.
Combines SFT and alignment into a single training stage using odds ratio of preferred vs. dispreferred responses.
| Method | Training Data | Models in Memory | RL Required | Stability | Compute |
|---|---|---|---|---|---|
| RLHF (PPO) | Pairwise preferences | 4 | Yes | Lower | Very High |
| DPO | Pairwise preferences | 2 | No | High | Medium |
| IPO | Pairwise preferences | 2 | No | High | Medium |
| KTO | Binary feedback | 2 | No | High | Medium |
| ORPO | Pairwise preferences | 1 | No | High | Low |
Alignment quality is bounded by preference data quality. Best practices:
See NeMo Aligner for implementation details in NVIDIA’s stack.