Knowledge Distillation (KD) is: Training a smaller student model to mimic a larger teacher model.
Instead of training on only ground-truth labels, the student also learns from:
NeMo supports distillation in:
Loss function
Loss = α * CE(student, labels)
+ β * KL(student_logits, teacher_logits)
Student matches:
Attention Distillation
Teacher Model (eval mode) - weigths frozen
↓
Generate logits
↓
Student Model (train mode)
↓
Compute:
CE loss + KL divergence
↓
Backprop on student only