nvidia-generative-ai-notes

Torch Run

torchrun is the official distributed training launcher from PyTorch. It is framework-agnostic. NeMo internally relies on it for multi-GPU and multi-node training.

What it does

Example (NeMo training via torchrun)

torchrun \
  --nproc_per_node=8 \
  --nnodes=1 \
  --master_port=29500 \
  train.py \
  trainer.devices=8 \
  trainer.num_nodes=1

Think of torchrun as low-level distributed process manager.

Nemo Run - Nemo 2.0 Orchestrator

nemo run is part of NeMo 2.0’s new execution system. It’s a higher-level orchestration CLI built by NVIDIA.

It doesn’t just launch processes — it manages:

NeMo-Run simplifies the user experience for training NeMo models, while torchrun provides the underlying distributed execution capabilities.

Example

nemo run \
  model=llama3_8b \
  trainer.devices=8 \
  trainer.num_nodes=1 \
  exp_manager.name=my_experiment