NeMo Aligner is NVIDIA’s scalable toolkit for aligning large language models with human preferences. It integrates with Megatron-Core for distributed training and supports multiple alignment algorithms.
Pretrained and supervised fine-tuned (SFT) models generate fluent text but may produce harmful, biased, or unhelpful responses. Alignment tunes model behavior to be helpful, harmless, and honest using human feedback signals.
The typical post-training pipeline:
Pretrained Model → SFT (demonstrations) → Alignment (preferences) → Production
The classic three-stage alignment pipeline:
Reward model loss (Bradley-Terry preference model):
L_RM = -E[log σ(r(x, y_chosen) - r(x, y_rejected))]
PPO objective with KL penalty:
L_PPO = -E[reward(x, y)] + β · KL(π_θ || π_SFT)
The KL term prevents the policy from deviating too far from the SFT model, avoiding reward hacking.
Eliminates the reward model by directly optimizing the policy on preference pairs:
L_DPO = -E[log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)))]
Where y_w = chosen completion, y_l = rejected completion, π_ref = reference (SFT) model.
Attribute-conditioned generation: annotate training data with quality attributes (helpfulness, correctness, coherence) and condition generation on desired attribute values at inference time. Offers finer-grained control than binary preference methods.
| Method | Reward Model | Training Stability | Compute Cost | Implementation Complexity |
|---|---|---|---|---|
| RLHF (PPO) | Required | Lower (RL instability) | High (3 models in memory) | High |
| DPO | Not needed | Higher | Medium (2 models) | Medium |
| SteerLM | Not needed | High | Medium | Medium |
{"prompt": ..., "chosen": ..., "rejected": ...}{
"prompt": "Explain quantum computing in simple terms.",
"chosen": "Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously...",
"rejected": "Quantum computing is a type of computing that uses quantum mechanics. It is very complicated..."
}