Quantization reduces the numerical precision of model weights and activations — from FP32/FP16 to INT8, INT4, or FP8 — to decrease memory footprint, increase throughput, and reduce energy consumption with minimal accuracy loss.
A 70B parameter model in FP16 requires ~140 GB of memory for weights alone. Quantization changes the equation:
| Precision | Bits | 70B Model Size | Memory Savings |
|---|---|---|---|
| FP32 | 32 | 280 GB | Baseline |
| FP16/BF16 | 16 | 140 GB | 2x |
| FP8 | 8 | 70 GB | 4x |
| INT8 | 8 | 70 GB | 4x |
| INT4 | 4 | 35 GB | 8x |
Beyond memory, lower precision enables:
Q(x) = round(x / scale) + zero_point
x̂ = (Q(x) - zero_point) × scale
Scale maps the floating-point range to the integer range. Zero point handles asymmetric distributions.
Quantize a trained model without retraining. Requires a small calibration dataset (128-512 samples) to estimate activation ranges.
Workflow:
Pros: Fast (minutes to hours), no training infrastructure needed Cons: Accuracy may degrade, especially at INT4
Simulate quantization during training so the model learns to be robust to reduced precision.
Mechanism: Insert fake quantization operators in the forward pass:
Forward: x → quantize → dequantize → next_layer (simulates quantization error)
Backward: Straight-Through Estimator (STE) — gradients pass through unchanged
Pros: Best accuracy retention, even at INT4 Cons: Requires full training run, expensive
Problem: Activations have outlier channels with large magnitudes, making them hard to quantize. Weights are smooth and easy to quantize.
Solution: Migrate quantization difficulty from activations to weights:
Y = X · W = (X · diag(s)^-1) · (diag(s) · W) = X̂ · Ŵ
Where s is a per-channel smoothing factor: s_j = max(|X_j|)^α / max(|W_j|)^(1-α), with α ∈ [0, 1].
After smoothing, both activations and weights are within quantizable ranges. Typical α = 0.5.
Result: W8A8 (8-bit weights, 8-bit activations) with near-lossless accuracy.
Key insight: Not all weight channels are equally important. Channels corresponding to large activation magnitudes have disproportionate impact on output quality.
Method:
s* = argmin_s ||Q(W · diag(s)) · diag(s)^-1 · X - W · X||
Result: INT4 weight quantization with minimal perplexity degradation. Works without retraining.
Based on Optimal Brain Compression (OBC), GPTQ quantizes weights layer by layer using second-order information:
H = 2X^TX (from calibration data)w_q = argmin_q (w - q)² / H_qq^-1
δ_remaining = -(w - w_q) / H_qq · H_q,:
Result: INT4/INT3 with very low accuracy loss. Slower calibration than AWQ but sometimes better accuracy.
H100 GPUs support native FP8 (E4M3 for forward, E5M2 for backward) via Transformer Engine:
FP8 forward pass:
1. Track amax of input activation
2. Compute scale = FP8_MAX / amax
3. Quantize input to FP8: x_fp8 = cast_to_fp8(x * scale)
4. Matrix multiply in FP8
5. Output in FP16/BF16
Integration: Megatron-Core enables FP8 with --fp8-format hybrid --transformer-impl transformer_engine.
Benefit: ~2x throughput over BF16 on H100 with minimal accuracy loss for training.
NVIDIA’s unified quantization toolkit (formerly TensorRT Model Optimizer):
import modelopt.torch.quantization as mtq
# PTQ with INT4 AWQ
model = mtq.quantize(model, mtq.INT4_AWQ_CFG, forward_loop=calibrate)
# Export to TensorRT-LLM
mtq.export(model, output_dir="quantized_checkpoint/")
| Use Case | Recommended Method | Why |
|---|---|---|
| Training speedup on H100 | FP8 (Transformer Engine) | Native hardware support, minimal accuracy loss |
| Inference INT8 | SmoothQuant | Best W8A8 accuracy, well-supported |
| Inference INT4 (quality focus) | AWQ or GPTQ | Near-lossless at 4-bit |
| Inference INT4 (speed focus) | AWQ | Faster calibration, good TRT-LLM support |
| Maximum accuracy at low precision | QAT | Retraining compensates quantization error |
| Quick deployment | PTQ (basic) | Fastest, good enough for many workloads |
Training (FP8 via Transformer Engine in Megatron-Core)
↓
Post-Training Quantization (modelopt)
↓
Engine Build (TensorRT-LLM with quantized weights)
↓
Deployment (Triton / NIM)
See TensorRT-LLM for inference engine compilation with quantized models.