File tree Expand file tree Collapse file tree 1 file changed +20
-0
lines changed
docs/software/communication Expand file tree Collapse file tree 1 file changed +20
-0
lines changed Original file line number Diff line number Diff line change 1+ # This forces NCCL to use the libfabric plugin, enabling full use of the
2+ # Slingshot network. If the plugin can not be found, applications will fail to
3+ # start. With the default value, applications would instead fall back to e.g.
4+ # TCP, which would be significantly slower than with the plugin. More information
5+ # about `NCCL_NET` can be found at
6+ # https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net.
7+ export NCCL_NET="AWS Libfabric"
8+ # Use GPU Direct RDMA when GPU and NIC are on the same NUMA node. More
9+ # information about `NCCL_NET_GDR_LEVEL` can be found at
10+ # https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-net-gdr-level-formerly-nccl-ib-gdr-level).
11+ export NCCL_NET_GDR_LEVEL=PHB
12+ export NCCL_CROSS_NIC=1
13+ # These `FI` (libfabric) environment variables have been found to give the best
14+ # performance on the Alps network across a wide range of applications. Specific
15+ # applications may perform better with other values.
16+ export FI_CXI_DEFAULT_CQ_SIZE=131072
17+ export FI_CXI_DEFAULT_TX_SIZE=32768
18+ export FI_CXI_DISABLE_HOST_REGISTER=1
19+ export FI_CXI_RX_MATCH_MODE=software
20+ export FI_MR_CACHE_MONITOR=userfaultfd
You can’t perform that action at this time.
0 commit comments