torchrun is the official distributed training launcher from PyTorch. It is framework-agnostic. NeMo internally relies on it for multi-GPU and multi-node training.
What it does
torchrun \
--nproc_per_node=8 \
--nnodes=1 \
--master_port=29500 \
train.py \
trainer.devices=8 \
trainer.num_nodes=1
Think of torchrun as low-level distributed process manager.
nemo run is part of NeMo 2.0’s new execution system. It’s a higher-level orchestration CLI built by NVIDIA.
It doesn’t just launch processes — it manages:
NeMo-Run simplifies the user experience for training NeMo models, while torchrun provides the underlying distributed execution capabilities.
nemo run \
model=llama3_8b \
trainer.devices=8 \
trainer.num_nodes=1 \
exp_manager.name=my_experiment