The NVIDIA generative AI stack is hardware-aware by design. Understanding GPU architectures, interconnects, and system topologies is essential for choosing optimal parallelism strategies and achieving peak training/inference throughput.
The current workhorse for LLM training and inference:
| Spec | H100 SXM | H200 SXM |
|---|---|---|
| Tensor Cores | 4th gen (FP8 support) | 4th gen (FP8 support) |
| HBM | 80 GB HBM3 | 141 GB HBM3e |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s |
| FP16 Tensor | 989 TFLOPS | 989 TFLOPS |
| FP8 Tensor | 1,979 TFLOPS | 1,979 TFLOPS |
| NVLink | 900 GB/s | 900 GB/s |
| TDP | 700W | 700W |
Key features:
Next-generation architecture:
| Spec | B200 |
|---|---|
| Tensor Cores | 5th gen (FP4/FP6 support) |
| HBM | 192 GB HBM3e |
| Memory bandwidth | 8 TB/s |
| FP8 Tensor | ~4,500 TFLOPS |
| FP4 Tensor | ~9,000 TFLOPS |
| NVLink | 1,800 GB/s |
Key improvements:
CPU + GPU unified architecture:
Built into H100 and later GPUs, Transformer Engine automatically manages FP8 precision during training:
scale = FP8_MAX / amax — maps the tensor range to FP8 representable range| Format | Exponent | Mantissa | Range | Precision | Use |
|---|---|---|---|---|---|
| E4M3 | 4 bits | 3 bits | ±448 | Higher | Forward pass |
| E5M2 | 5 bits | 2 bits | ±57344 | Lower | Backward pass (gradients) |
Transformer Engine is integrated into Megatron-Core. Enable with:
# In Megatron-Core config
--fp8-format hybrid # E4M3 forward, E5M2 backward
--fp8-amax-history-len 1024 # Sliding window for amax tracking
--fp8-amax-compute-algo max # How to compute amax from history
--transformer-impl transformer_engine
Result: ~2x training throughput on H100 vs. BF16, with minimal accuracy loss for most architectures.
High-bandwidth, low-latency GPU-to-GPU interconnect:
| Generation | Per-GPU Bandwidth | GPUs Connected |
|---|---|---|
| NVLink 3.0 (A100) | 600 GB/s | Up to 8 via NVSwitch |
| NVLink 4.0 (H100) | 900 GB/s | Up to 8 via NVSwitch |
| NVLink 5.0 (B200) | 1,800 GB/s | Up to 8 via NVSwitch |
Use case: Tensor Parallelism within a node. TP requires all-reduce after every layer, demanding highest bandwidth and lowest latency.
Chip-level switch that provides full-bandwidth NVLink connectivity between all GPUs in a node:
Without NVSwitch: GPU0 ←→ GPU1 ←→ GPU2 (ring, limited paths)
With NVSwitch: GPU0 ←→ GPU1, GPU0 ←→ GPU2, GPU1 ←→ GPU2 (full mesh)
Node-to-node networking for multi-node training:
| Generation | Bandwidth (per port) | Latency |
|---|---|---|
| HDR | 200 Gbps | ~1 μs |
| NDR | 400 Gbps | ~1 μs |
| XDR | 800 Gbps | <1 μs |
Use case: Pipeline Parallelism and Data Parallelism across nodes.
Features:
The building block for large-scale training:
Cluster of DGX nodes for training at scale:
The bandwidth hierarchy dictates where each parallelism strategy should operate:
Within GPU: ~3 TB/s (HBM bandwidth) → Computation
NVLink: 900 GB/s (H100) → Tensor Parallelism (TP)
NVSwitch: Full mesh at NVLink speed → Expert Parallelism (EP)
InfiniBand: 400 Gbps (~50 GB/s) → Pipeline Parallelism (PP)
Cross-rack: Multiple IB links → Data Parallelism (DP)
Example: Training LLaMA 70B on 64 GPUs (8 DGX H100 nodes)
TP=8 (within each DGX node, over NVLink)
PP=4 (across 4 nodes per pipeline, over InfiniBand)
DP=2 (2 pipeline replicas, gradient sync over InfiniBand)
This configuration:
NCCL_TOPO_FILE for optimal algorithm selectionSee distributed training for parallelism strategy details and NCCL for communication primitives.