nvidia-generative-ai-notes

Knowledge Distillation

Knowledge Distillation (KD) is: Training a smaller student model to mimic a larger teacher model.

Instead of training on only ground-truth labels, the student also learns from:

NeMo supports distillation in:

Typical Approaches

Logit Distillation (Most Common)

Loss function

Loss = α * CE(student, labels)
     + β * KL(student_logits, teacher_logits)

Hidden State Matching

Student matches:

Attention Distillation

Attention Distillation

Nemo/Megatron Core Setup

Teacher Model (eval mode) - weigths frozen
       ↓
Generate logits
       ↓
Student Model (train mode)
       ↓
Compute:
  CE loss + KL divergence
       ↓
Backprop on student only

Code

Knowledge Distillation with Gemma2