Here is an example of running distributed model training on 2 nodes (with 2 GPU each) in pure PyTorch way (No Nemo/Megatron involved)
torchrun \
--nnodes=2 --nproc_per_node=2 \
--node_rank=0 \
--rdzv_backend=c10d \
--rdzv_endpoint=<NODE0_IP>:29500 \
train_byt5_ddp.py
--nnodes=2: Total number of nodes in the cluster--node_rank: Node identification 0 for node 1 and 1 for node 2--nproc_per_node=2 tells each node: “spawn 2 worker processes on this node (usually 1 per GPU).”--rdzv_endpoint=<NODE0_IP>:29500 tells every node: “meet at the same coordinator address.”--rdzv_backend=c10d tells them which rendezvous mechanism to use. c10d is PyTorch’s distributed TCP-based co-ordination backend layer.train_byt5_ddp.py is the same entrypoint on all nodes.--node_rank.<NODE0_IP>:29500. It is just a lightweight key-value store for coordinationNCCL is a high-performance communication library optimized for NVIDIA GPUs.
It handles:
Over:
PyTorch is compiled with NCCL support.
You can verify.
import torch
print(torch.cuda.nccl.version())
Here is how NCCL gets activated.
torch.distributed.init_process_group(
backend="nccl"
)
PyTorch does:
After that, PyTorch doesn’t micromanage NCCL.