nvidia-generative-ai-notes

PyTorch Distributed Training

Here is an example of running distributed model training on 2 nodes (with 2 GPU each) in pure PyTorch way (No Nemo/Megatron involved)

torchrun \
  --nnodes=2 --nproc_per_node=2 \
  --node_rank=0 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=<NODE0_IP>:29500 \
  train_byt5_ddp.py

--nnodes=2: Total number of nodes in the cluster
--node_rank: Node identification 0 for node 1 and 1 for node 2
--nproc_per_node=2 tells each node: “spawn 2 worker processes on this node (usually 1 per GPU).”
--rdzv_endpoint=<NODE0_IP>:29500 tells every node: “meet at the same coordinator address.”
--rdzv_backend=c10d tells them which rendezvous mechanism to use. c10d is PyTorch’s distributed TCP-based co-ordination backend layer.
train_byt5_ddp.py is the same entrypoint on all nodes.

Each node runs the same command with different --node_rank.
On node_rank=0, it starts a small TCP store (TCPStore) server at <NODE0_IP>:29500. It is just a lightweight key-value store for coordination
Other nodes connect to it.
They exchange:
- World size
- Rank assignments
- NCCL connection info
- Environment setup data
Real GPU communication happens via NCCL
NCCL opens its own high-performance channels
That may use:
- IB (InfiniBand)
- NVLink
- TCP fallback

NCCL is a high-performance communication library optimized for NVIDIA GPUs.

It handles:

Over:

PyTorch is compiled with NCCL support.

You can verify.

import torch
print(torch.cuda.nccl.version())

Here is how NCCL gets activated.

torch.distributed.init_process_group(
    backend="nccl"
)

PyTorch does:

After that, PyTorch doesn’t micromanage NCCL.