|
1 |
| -Perplexity MoE Kernels |
2 |
| -========== |
| 1 | +# Perplexity MoE Kernels |
3 | 2 |
|
4 |
| -# Installation |
| 3 | +## Installation |
5 | 4 |
|
6 | 5 | ```bash
|
7 | 6 | cd pplx-kernels
|
8 | 7 | TORCH_CUDA_ARCH_LIST=9.0a+PTX python3 setup.py bdist_wheel
|
9 | 8 | pip install dist/*.whl
|
10 | 9 | ```
|
11 | 10 |
|
12 |
| -# Testing |
| 11 | +## Single-node Testing and Benchmarking |
13 | 12 |
|
14 |
| -To build the C++ tests and benchmarks: |
15 |
| - |
16 |
| -```bash |
17 |
| -cd pplx-kernels |
18 |
| -mkdir build-cmake |
19 |
| -cd build-cmake |
20 |
| - |
21 |
| -TORCH_PREFIX_PATH=$(python3 -c 'import torch; print(torch.utils.cmake_prefix_path)') |
22 |
| - |
23 |
| -cmake ../csrc \ |
24 |
| - -GNinja \ |
25 |
| - -DCMAKE_PREFIX_PATH=$TORCH_PREFIX_PATH \ |
26 |
| - -DTORCH_CUDA_ARCH_LIST=9.0a+PTX \ |
27 |
| - -DWITH_TESTS=ON \ |
28 |
| - -DWITH_BENCHMARKS=ON |
29 |
| - |
30 |
| -ninja test_all_to_all bench_all_to_all |
31 |
| -``` |
32 |
| - |
33 |
| -To run the all-to-all tests on one node: |
| 13 | +Test: |
34 | 14 |
|
35 | 15 | ```bash
|
36 |
| -NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./test_all_to_all |
| 16 | +pytest -svx --tb=short tests |
37 | 17 | ```
|
38 | 18 |
|
39 |
| - |
40 |
| -To run the all-to-all benchmarks on one node: |
| 19 | +Benchmark: |
41 | 20 |
|
42 | 21 | ```bash
|
43 |
| -NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./bench_all_to_all |
| 22 | +python3 -m tests.bench_all_to_all |
44 | 23 | ```
|
45 | 24 |
|
46 |
| - |
47 |
| -# Inter-Node Benchmarks |
48 |
| - |
49 |
| -To test on a 32-device cluster spread across 4 nodes, run the following command on all nodes, alternating the rank from 0 to 3 and setting the master address to point to the rank-0 node: |
| 25 | +## Multi-node Testing and Benchmarking |
50 | 26 |
|
51 | 27 | ```bash
|
52 |
| -export NODE_RANK=<rank> |
53 |
| -export WORLD_SIZE=32 |
| 28 | +export NODE_RANK= # 0, 1, ..., num_nodes-1 |
| 29 | +export WORLD_SIZE= # num_nodes * 8 |
54 | 30 | export WORLD_LOCAL_SIZE=8
|
55 |
| -export MASTER_ADDR=<master-address> |
| 31 | +export MASTER_ADDR= # IP address of rank-0 node |
56 | 32 | export MASTER_PORT=29500
|
57 | 33 | export NVSHMEM_IB_ENABLE_IBGDA=1
|
58 |
| -python3 -m tests.bench_all_to_all |
59 | 34 | ```
|
60 | 35 |
|
61 |
| -# Benchmark Results |
| 36 | +After settings these environment variables, commands to run the tests and benchmarks are the same as the single-node case. |
| 37 | + |
| 38 | +## Benchmark Results |
62 | 39 |
|
63 | 40 | 1 token per GPU:
|
64 | 41 |
|
@@ -92,3 +69,38 @@ python3 -m tests.bench_all_to_all
|
92 | 69 | | NVLINK NVSHMEM AtA | x | x | x | x | 6585.3μs ± 2.4μs |
|
93 | 70 | | IBGDA NVSHMEM AtA | 6180.1μs ± 344.7μs | 6916.3μs ± 315.4μs | 4603.4μs ± 133.1μs | 3444.8μs ± 15.3μs | x |
|
94 | 71 | | IBRC NVSHMEM AtA | 6378.5μs ± 375.9μs | 6625.1μs ± 371.3μs | 4371.3μs ± 148.8μs | 3410.1μs ± 20.2μs | x |
|
| 72 | + |
| 73 | + |
| 74 | +## C++ Testing |
| 75 | + |
| 76 | +To build the C++ tests and benchmarks: |
| 77 | + |
| 78 | +```bash |
| 79 | +cd pplx-kernels |
| 80 | +mkdir build-cmake |
| 81 | +cd build-cmake |
| 82 | + |
| 83 | +TORCH_PREFIX_PATH=$(python3 -c 'import torch; print(torch.utils.cmake_prefix_path)') |
| 84 | + |
| 85 | +cmake ../csrc \ |
| 86 | + -GNinja \ |
| 87 | + -DCMAKE_PREFIX_PATH=$TORCH_PREFIX_PATH \ |
| 88 | + -DTORCH_CUDA_ARCH_LIST=9.0a+PTX \ |
| 89 | + -DWITH_TESTS=ON \ |
| 90 | + -DWITH_BENCHMARKS=ON |
| 91 | + |
| 92 | +ninja test_all_to_all bench_all_to_all |
| 93 | +``` |
| 94 | + |
| 95 | +To run the all-to-all tests on one node: |
| 96 | + |
| 97 | +```bash |
| 98 | +NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./test_all_to_all |
| 99 | +``` |
| 100 | + |
| 101 | + |
| 102 | +To run the all-to-all benchmarks on one node: |
| 103 | + |
| 104 | +```bash |
| 105 | +NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./bench_all_to_all |
| 106 | +``` |
0 commit comments