Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
modal_train.py	modal_train.py
train.py	train.py

Name

Last commit message

Last commit date

Multi-node NCCL Benchmark

Simple NCCL bandwidth benchmark for Modal's multi-node training infrastructure. This benchmark measures the all-reduce performance between nodes using a large tensor (500000x2000 float32), in order to understand the communication performance of your multi-node setup.

Usage

2 x 8 x H100, multi-node:

modal run modal_train.py

Performance Metrics

The benchmark reports two key metrics:

algbw (Algorithm Bandwidth): The effective bandwidth from the application's perspective.
busbw (Bus Bandwidth): The actual hardware bandwidth utilization.

For example:

The average bandwidth of all_reduce with a 4.0GB payload (50 trials, 16 ranks):
 algbw: 239.742 GBps (1917.9 Gbps)
 busbw: 449.515 GBps (3596.1 Gbps)

For more details on these metrics, see the NVIDIA NCCL Tests documentation.

Environment Configuration

The benchmark automatically configures RDMA settings for Modal's infrastructure:

Uses IPv6 for control plane (TCP) communication
Uses IPv4 for data plane (RDMA) communication
Configures optimal NCCL parameters for IB/RDMA
Sets appropriate HCA device ordering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Multi-node NCCL Benchmark

Usage

Performance Metrics

Environment Configuration

FilesExpand file tree

benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark

Folders and files

parent directory

README.md

Multi-node NCCL Benchmark

Usage

Performance Metrics

Environment Configuration