Skip to content

Commit 32ef585

Browse files
committed
add Install Driver and Dependencies guide
1 parent 50d830d commit 32ef585

File tree

2 files changed

+132
-0
lines changed

2 files changed

+132
-0
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Perplexity MoE Kernels
22

3+
## System Requirements
4+
5+
To learn how to set up the system drivers and dependencies, refer to the [Install Driver and Dependencies](docs/install-driver-and-dependencies.md) guide.
6+
37
## Installation
48

59
```bash
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Install Driver and Dependencies
2+
3+
Here's a summary of the software and drivers required for running pplx-kernels on a single-node or multi-node cluster with Mellanox ConnectX or AWS Elastic Fabric Adapter (EFA) network interfaces. Configure your system and software accordingly.
4+
5+
| Software | Single-node | Multi-node with ConnectX | Multi-node with EFA |
6+
|---------------------------|-------------|--------------------------|---------------------|
7+
| NVIDIA Driver | Y | Y | Y |
8+
| modprobe.d/nvidia.conf | | Y | |
9+
| GDRCopy Driver | | Y | Y |
10+
| GDRCopy Library | | Y | Y |
11+
| NVSHMEM Library | Y | Y | Y |
12+
| NVSHMEM_USE_GDRCOPY | | 1 | 1 |
13+
| NVSHMEM_IBRC_SUPPORT | | 1 | |
14+
| NVSHMEM_IBGDA_SUPPORT | | 1 | |
15+
| NVSHMEM_LIBFABRIC_SUPPORT | | | 1 |
16+
| Libfabric Library | | | Y |
17+
| EFA Driver | | | Y |
18+
19+
## NVIDIA Driver Config
20+
21+
To use IBGDA, NVIDIA Driver needs to be configured to allow GPU to initiate communication.
22+
23+
```bash
24+
echo 'options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"' | sudo tee -a /etc/modprobe.d/nvidia.conf
25+
sudo update-initramfs -u
26+
sudo reboot
27+
```
28+
29+
## GDRCopy
30+
31+
GDRCopy is needed for multi-node.
32+
33+
```bash
34+
sudo apt-get install -y build-essential devscripts debhelper fakeroot pkg-config dkms
35+
wget -O gdrcopy-v2.4.4.tar.gz https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.4.tar.gz
36+
tar xf gdrcopy-v2.4.4.tar.gz
37+
cd gdrcopy-2.4.4/
38+
sudo make prefix=/opt/gdrcopy -j$(nproc) install
39+
40+
cd packages/
41+
CUDA=/usr/local/cuda ./build-deb-packages.sh
42+
sudo dpkg -i gdrdrv-dkms_2.4.4_amd64.Ubuntu22_04.deb \
43+
gdrcopy-tests_2.4.4_amd64.Ubuntu22_04+cuda12.6.deb \
44+
gdrcopy_2.4.4_amd64.Ubuntu22_04.deb \
45+
libgdrapi_2.4.4_amd64.Ubuntu22_04.deb
46+
```
47+
48+
Verify installation:
49+
50+
```bash
51+
/opt/gdrcopy/bin/gdrcopy_copybw
52+
```
53+
54+
## NVSHMEM
55+
56+
There are many configurations for NVSHMEM.
57+
Besides the required configurations listed on the top of this page, here are some additional optional features:
58+
59+
* NVSHMEM_MPI_SUPPORT: For MPI support
60+
* NVSHMEM_PMIX_SUPPORT: For PMIx support (e.g., slurm)
61+
* NVSHMEM_BUILD_HYDRA_LAUNCHER: For Hydra launcher
62+
63+
Change the following options accordingly.
64+
65+
```bash
66+
wget https://developer.nvidia.com/downloads/assets/secure/nvshmem/nvshmem_src_3.2.5-1.txz
67+
mkdir nvshmem_src_3.2.5-1
68+
tar xf nvshmem_src_3.2.5-1.txz -C nvshmem_src_3.2.5-1
69+
cd nvshmem_src_3.2.5-1/nvshmem_src
70+
mkdir -p build
71+
cd build
72+
cmake \
73+
-DNVSHMEM_PREFIX=/opt/nvshmem-3.2.5 \
74+
-DCMAKE_CUDA_ARCHITECTURES=90a \
75+
-DNVSHMEM_MPI_SUPPORT=1 \
76+
-DNVSHMEM_PMIX_SUPPORT=1 \
77+
-DNVSHMEM_LIBFABRIC_SUPPORT=1 \
78+
-DNVSHMEM_IBRC_SUPPORT=1 \
79+
-DNVSHMEM_IBGDA_SUPPORT=1 \
80+
-DNVSHMEM_BUILD_TESTS=1 \
81+
-DNVSHMEM_BUILD_EXAMPLES=1 \
82+
-DNVSHMEM_BUILD_HYDRA_LAUNCHER=1 \
83+
-DNVSHMEM_BUILD_TXZ_PACKAGE=1 \
84+
-DMPI_HOME=/opt/amazon/openmpi \
85+
-DPMIX_HOME=/opt/amazon/pmix \
86+
-DGDRCOPY_HOME=/opt/gdrcopy \
87+
-DLIBFABRIC_HOME=/opt/amazon/efa \
88+
-G Ninja \
89+
..
90+
ninja build
91+
sudo ninja install
92+
```
93+
94+
After installation, add the following environment variables:
95+
96+
```bash
97+
export NVSHMEM_HOME=/opt/nvshmem-3.2.5
98+
export LD_LIBRARY_PATH=$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH
99+
100+
# For single-node
101+
export NVSHMEM_REMOTE_TRANSPORT=none
102+
103+
# For multi-node with ConnectX
104+
export NVSHMEM_REMOTE_TRANSPORT=ibrc
105+
export NVSHMEM_IB_ENABLE_IBGDA=1
106+
107+
# For multi-node with EFA
108+
export NVSHMEM_REMOTE_TRANSPORT=libfabric
109+
export NVSHMEM_LIBFABRIC_PROVIDER=efa
110+
```
111+
112+
To install Hydra launcher:
113+
114+
```bash
115+
cd nvshmem_src_3.2.5-1/nvshmem_src/
116+
sed -i 's/^make/make -j/' scripts/install_hydra.sh
117+
sudo bash scripts/install_hydra.sh hydra-build /opt/hydra
118+
```
119+
120+
Verify installation:
121+
122+
```bash
123+
# Using Hydra:
124+
/opt/hydra/bin/nvshmrun.hydra -genvlist LD_LIBRARY_PATH -hosts host1,host2 -n 2 -ppn 1 /opt/nvshmem-3.2.5/bin/perftest/device/pt-to-pt/shmem_put_latency
125+
126+
# Using MPI:
127+
NVSHMEM_BOOTSTRAP=MPI mpirun -x LD_LIBRARY_PATH -x NVSHMEM_BOOTSTRAP -H host1,host2 /opt/nvshmem-3.2.5/bin/perftest/device/pt-to-pt/shmem_put_latency
128+
```

0 commit comments

Comments
 (0)