Megatron recipes are pre-configured training setups that encode best practices for common model architectures and scales. They specify model architecture, parallelism strategy, optimizer settings, and training hyperparameters — providing a tested, reproducible starting point.
| Model Size | Layers | Hidden | Heads | TP | PP | Micro BS | Global BS | Seq Len |
|---|---|---|---|---|---|---|---|---|
| 1.3B | 24 | 2048 | 16 | 1 | 1 | 4 | 256 | 2048 |
| 7B | 32 | 4096 | 32 | 1 | 1 | 2 | 512 | 4096 |
| 13B | 40 | 5120 | 40 | 2 | 1 | 1 | 512 | 4096 |
| 70B | 80 | 8192 | 64 | 8 | 4 | 1 | 1024 | 4096 |
| 175B | 96 | 12288 | 96 | 8 | 8 | 1 | 1536 | 2048 |
TP (Tensor Parallelism) stays within a node (NVLink). PP (Pipeline Parallelism) spans across nodes (InfiniBand). Global batch size scales with the number of data-parallel replicas.
A complete recipe includes:
from nemo.collections.llm import GPTConfig7B, pretrain
from nemo.lightning import MegatronStrategy
# Load a pre-defined recipe
config = GPTConfig7B()
# Override specific settings for your hardware
config.trainer.num_nodes = 4
config.trainer.devices = 8 # GPUs per node
config.model.tensor_model_parallel_size = 1
config.model.pipeline_model_parallel_size = 1
pretrain(config)
sqrt(batch_size) or linearly| Scenario | Approach |
|---|---|
| Standard architecture at known scale | Use recipe directly |
| Standard architecture, different hardware | Recipe + adjust parallelism |
| Custom architecture | Recipe as starting point, modify architecture params |
| Novel training objective | Custom config, use recipe optimizer/schedule settings |