Skip to content

Commit bd50f7d

Browse files
committed
Add nccl_env_vars file
1 parent 88389d3 commit bd50f7d

File tree

1 file changed

+20
-0
lines changed

1 file changed

+20
-0
lines changed
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# This forces NCCL to use the libfabric plugin, enabling full use of the
2+
# Slingshot network. If the plugin can not be found, applications will fail to
3+
# start. With the default value, applications would instead fall back to e.g.
4+
# TCP, which would be significantly slower than with the plugin. More information
5+
# about `NCCL_NET` can be found at
6+
# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net.
7+
export NCCL_NET="AWS Libfabric"
8+
# Use GPU Direct RDMA when GPU and NIC are on the same NUMA node. More
9+
# information about `NCCL_NET_GDR_LEVEL` can be found at
10+
# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net-gdr-level-formerly-nccl-ib-gdr-level).
11+
export NCCL_NET_GDR_LEVEL=PHB
12+
export NCCL_CROSS_NIC=1
13+
# These `FI` (libfabric) environment variables have been found to give the best
14+
# performance on the Alps network across a wide range of applications. Specific
15+
# applications may perform better with other values.
16+
export FI_CXI_DEFAULT_CQ_SIZE=131072
17+
export FI_CXI_DEFAULT_TX_SIZE=32768
18+
export FI_CXI_DISABLE_HOST_REGISTER=1
19+
export FI_CXI_RX_MATCH_MODE=software
20+
export FI_MR_CACHE_MONITOR=userfaultfd

0 commit comments

Comments
 (0)