NCCL test (https://github.com/NVIDIA/nccl-tests) provide benchmarking tools for NCCL operations over TCP/IP or RDMA interconnects.
mpirun --prefix /usr/local \
--launch-agent prted \
-np 2 -host 10.0.0.131,10.0.0.147 -N 1 \
./build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
Command breakdown:
mpirun: MPI launcher that starts the distributed job across multiple ranks (processes).--prefix /usr/local: Adds /usr/local to the MPI runtime’s search path so that the same MPI installation (binaries, libraries) is used on all hosts.--launch-agent prted: Tells mpirun to use the Open MPI prted daemon on each node to spawn and manage the processes.-np 2: Run with 2 MPI processes (2 ranks in total).-host 10.0.0.131,10.0.0.147: Use these two machines as the hosts for the MPI ranks.-N 1: Launch 1 MPI process per node (so 1 rank on each host for a total of 2)../build/all_reduce_perf: NCCL test binary that benchmarks the all-reduce collective.-b 8: Minimum message size is 8 bytes.-e 1G: Maximum message size is 1 GiB.-f 2: Use message sizes that increase by a factor of 2 between tests (geometric progression).-g 1: Use 1 GPU per process (rank) in the benchmark.--prefix or module systems so mpirun picks up identical installs everywhere.
```
prte –version mpirun –version mpirun (Open MPI) 5.0.9
nvcc –version vcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Tue_Dec_16_07:23:41_PM_PST_2025 Cuda compilation tools, release 13.1, V13.1.115 Build cuda_13.1.r13.1/compiler.37061995_0
python -c “import torch; print(torch.cuda.nccl.version())” (2, 26, 2) ```
NCCL_IB_DISABLE=1 (to force TCP) while debugging. Verify basic connectivity with ping and ssh before running NCCL tests.CUDA_ERROR_INVALID_DEVICE.-N visible GPUs and that CUDA_VISIBLE_DEVICES or nvidia-smi topo -m matches your intended mapping. Start with -g 1 and scale up once a 1-GPU-per-rank test is stable.NCCL_DEBUG=INFO (or WARN) to see algorithm/topology choicesNCCL_IB_HCA, NCCL_IB_GID_INDEX for InfiniBandNCCL_SOCKET_IFNAME for TCP/IP interfaces-b, -e, -f, topology (intra-node vs inter-node), and number of ranks when comparing runs. Small messages are latency-dominated; large messages are bandwidth-dominated.