Skip to content

Commit ba12b9b

Browse files
hanwen-clusterHimani Anil Deshpande
authored andcommitted
[integ-tests] Fix NCCL test on Ubuntu 24 on p6-200
Add NCCL_SOCKET_FAMILY=AF_INET to force NCCL to use IPv4. On Ubuntu 24 with p6-b200, without this parameter, NCCL hangs on IPv6, which is not supported by ParallelCluster Reference: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-family
1 parent c244455 commit ba12b9b

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

tests/integration-tests/tests/common/data/nccl/nccl_tests_submit_openmpi.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,12 @@ fi
1515

1616
# -x NCCL_ALGO=ring is not needed after NCCL 2.12. NCCL autodetects which ALGO (ring or tree) to use
1717
# -x FI_EFA_USE_DEVICE_RDMA=1 is not needed from aws nccl pfi plugin version v1.6.0
18+
# -x NCCL_SOCKET_FAMILY=AF_INET forces NCCL to use IPv4. On Ubuntu 24 with p6-b200, without this parameter, NCCL hangs on IPv6, which is not supported by ParallelCluster
1819
mpirun \
1920
-x LD_LIBRARY_PATH=/shared/openmpi/nccl-${NCCL_VERSION}/build/lib/:${OFI_PATH}:$LD_LIBRARY_PATH \
2021
-x NCCL_DEBUG=WARNING \
2122
-x NCCL_TESTS_SPLIT_MASK=0x0 \
2223
-x RDMAV_FORK_SAFE=1 \
24+
-x NCCL_SOCKET_FAMILY=AF_INET \
2325
--bind-to none \
2426
/shared/openmpi/nccl-tests-${NCCL_BENCHMARKS_VERSION}/build/all_reduce_perf -b 1024 -e 8G -f 2 -g 1 -c 1 > /shared/nccl_tests.out

0 commit comments

Comments
 (0)