|
| 1 | +# Install Driver and Dependencies |
| 2 | + |
| 3 | +Here's a summary of the software and drivers required for running pplx-kernels on a single-node or multi-node cluster with Mellanox ConnectX or AWS Elastic Fabric Adapter (EFA) network interfaces. Configure your system and software accordingly. |
| 4 | + |
| 5 | +| Software | Single-node | Multi-node with ConnectX | Multi-node with EFA | |
| 6 | +|---------------------------|-------------|--------------------------|---------------------| |
| 7 | +| NVIDIA Driver | Y | Y | Y | |
| 8 | +| modprobe.d/nvidia.conf | | Y | | |
| 9 | +| GDRCopy Driver | | Y | Y | |
| 10 | +| GDRCopy Library | | Y | Y | |
| 11 | +| NVSHMEM Library | Y | Y | Y | |
| 12 | +| NVSHMEM_USE_GDRCOPY | | 1 | 1 | |
| 13 | +| NVSHMEM_IBRC_SUPPORT | | 1 | | |
| 14 | +| NVSHMEM_IBGDA_SUPPORT | | 1 | | |
| 15 | +| NVSHMEM_LIBFABRIC_SUPPORT | | | 1 | |
| 16 | +| Libfabric Library | | | Y | |
| 17 | +| EFA Driver | | | Y | |
| 18 | + |
| 19 | +## NVIDIA Driver Config |
| 20 | + |
| 21 | +To use IBGDA, NVIDIA Driver needs to be configured to allow GPU to initiate communication. |
| 22 | + |
| 23 | +```bash |
| 24 | +echo 'options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"' | sudo tee -a /etc/modprobe.d/nvidia.conf |
| 25 | +sudo update-initramfs -u |
| 26 | +sudo reboot |
| 27 | +``` |
| 28 | + |
| 29 | +## GDRCopy |
| 30 | + |
| 31 | +GDRCopy is needed for multi-node. |
| 32 | + |
| 33 | +```bash |
| 34 | +sudo apt-get install -y build-essential devscripts debhelper fakeroot pkg-config dkms |
| 35 | +wget -O gdrcopy-v2.4.4.tar.gz https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.4.tar.gz |
| 36 | +tar xf gdrcopy-v2.4.4.tar.gz |
| 37 | +cd gdrcopy-2.4.4/ |
| 38 | +sudo make prefix=/opt/gdrcopy -j$(nproc) install |
| 39 | + |
| 40 | +cd packages/ |
| 41 | +CUDA=/usr/local/cuda ./build-deb-packages.sh |
| 42 | +sudo dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb \ |
| 43 | + gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.6.deb \ |
| 44 | + gdrcopy_2.4.4_amd64.Ubuntu22_04.deb \ |
| 45 | + libgdrapi_2.4.4_amd64.Ubuntu22_04.deb |
| 46 | +``` |
| 47 | + |
| 48 | +Verify installation: |
| 49 | + |
| 50 | +```bash |
| 51 | +/opt/gdrcopy/bin/gdrcopy_copybw |
| 52 | +``` |
| 53 | + |
| 54 | +## NVSHMEM |
| 55 | + |
| 56 | +There are many configurations for NVSHMEM. |
| 57 | +Besides the required configurations listed on the top of this page, here are some additional optional features: |
| 58 | + |
| 59 | +* NVSHMEM_MPI_SUPPORT: For MPI support |
| 60 | +* NVSHMEM_PMIX_SUPPORT: For PMIx support (e.g., slurm) |
| 61 | +* NVSHMEM_BUILD_HYDRA_LAUNCHER: For Hydra launcher |
| 62 | + |
| 63 | +Change the following options accordingly. |
| 64 | + |
| 65 | +```bash |
| 66 | +wget https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz |
| 67 | +mkdir nvshmem_src_3.2.5-1 |
| 68 | +tar xf nvshmem_src_3.2.5-1.txz -C nvshmem_src_3.2.5-1 |
| 69 | +cd nvshmem_src_3.2.5-1/nvshmem_src |
| 70 | +mkdir -p build |
| 71 | +cd build |
| 72 | +cmake \ |
| 73 | + -DNVSHMEM_PREFIX=/opt/nvshmem-3.2.5 \ |
| 74 | + -DCMAKE_CUDA_ARCHITECTURES=90a \ |
| 75 | + -DNVSHMEM_MPI_SUPPORT=1 \ |
| 76 | + -DNVSHMEM_PMIX_SUPPORT=1 \ |
| 77 | + -DNVSHMEM_LIBFABRIC_SUPPORT=1 \ |
| 78 | + -DNVSHMEM_IBRC_SUPPORT=1 \ |
| 79 | + -DNVSHMEM_IBGDA_SUPPORT=1 \ |
| 80 | + -DNVSHMEM_BUILD_TESTS=1 \ |
| 81 | + -DNVSHMEM_BUILD_EXAMPLES=1 \ |
| 82 | + -DNVSHMEM_BUILD_HYDRA_LAUNCHER=1 \ |
| 83 | + -DNVSHMEM_BUILD_TXZ_PACKAGE=1 \ |
| 84 | + -DMPI_HOME=/opt/amazon/openmpi \ |
| 85 | + -DPMIX_HOME=/opt/amazon/pmix \ |
| 86 | + -DGDRCOPY_HOME=/opt/gdrcopy \ |
| 87 | + -DLIBFABRIC_HOME=/opt/amazon/efa \ |
| 88 | + -G Ninja \ |
| 89 | + .. |
| 90 | +ninja build |
| 91 | +sudo ninja install |
| 92 | +``` |
| 93 | + |
| 94 | +After installation, add the following environment variables: |
| 95 | + |
| 96 | +```bash |
| 97 | +export NVSHMEM_HOME=/opt/nvshmem-3.2.5 |
| 98 | +export LD_LIBRARY_PATH=$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH |
| 99 | + |
| 100 | +# For single-node |
| 101 | +export NVSHMEM_REMOTE_TRANSPORT=none |
| 102 | + |
| 103 | +# For multi-node with ConnectX |
| 104 | +export NVSHMEM_REMOTE_TRANSPORT=ibrc |
| 105 | +export NVSHMEM_IB_ENABLE_IBGDA=1 |
| 106 | + |
| 107 | +# For multi-node with EFA |
| 108 | +export NVSHMEM_REMOTE_TRANSPORT=libfabric |
| 109 | +export NVSHMEM_LIBFABRIC_PROVIDER=efa |
| 110 | +``` |
| 111 | + |
| 112 | +To install Hydra launcher: |
| 113 | + |
| 114 | +```bash |
| 115 | +cd nvshmem_src_3.2.5-1/nvshmem_src/ |
| 116 | +sed -i 's/^make/make -j/' scripts/install_hydra.sh |
| 117 | +sudo bash scripts/install_hydra.sh hydra-build /opt/hydra |
| 118 | +``` |
| 119 | + |
| 120 | +Verify installation: |
| 121 | + |
| 122 | +```bash |
| 123 | +# Using Hydra: |
| 124 | +/opt/hydra/bin/nvshmrun.hydra -genvlist LD_LIBRARY_PATH -hosts host1,host2 -n 2 -ppn 1 /opt/nvshmem-3.2.5/bin/perftest/device/pt-to-pt/shmem_put_latency |
| 125 | + |
| 126 | +# Using MPI: |
| 127 | +NVSHMEM_BOOTSTRAP=MPI mpirun -x LD_LIBRARY_PATH -x NVSHMEM_BOOTSTRAP -H host1,host2 /opt/nvshmem-3.2.5/bin/perftest/device/pt-to-pt/shmem_put_latency |
| 128 | +``` |
0 commit comments