1
1
Perplexity MoE Kernels
2
2
==========
3
3
4
- Installation
5
- -----
4
+ # Installation
6
5
7
- ```
6
+ ``` bash
8
7
cd pplx-kernels
9
- pip install -e . -vvv
8
+ TORCH_CUDA_ARCH_LIST=9.0a+PTX python3 setup.py bdist_wheel
9
+ pip install dist/* .whl
10
10
```
11
11
12
- Testing
13
- -----
12
+ # Testing
14
13
15
14
To build the C++ tests and benchmarks:
16
15
17
- ```
16
+ ``` bash
18
17
cd pplx-kernels
19
18
mkdir build-cmake
20
19
cd build-cmake
@@ -33,25 +32,63 @@ ninja test_all_to_all bench_all_to_all
33
32
34
33
To run the all-to-all tests on one node:
35
34
36
- ```
37
- NVSHMEM_REMOTE_TRANSPORT=None mpirun -np 4 ./test_all_to_all
35
+ ``` bash
36
+ NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./test_all_to_all
38
37
```
39
38
40
39
41
40
To run the all-to-all benchmarks on one node:
42
41
43
- ```
44
- NVSHMEM_REMOTE_TRANSPORT=None mpirun -np 4 ./bench_all_to_all
42
+ ``` bash
43
+ NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./bench_all_to_all
45
44
```
46
45
47
46
48
- Inter-Node Benchmarks
49
- -----
47
+ # Inter-Node Benchmarks
50
48
51
- To test on a 32-device cluster spread across 4 nodes, run the following command on all nodes, alternating the rank from 0 to 3 and setting the master address to point to one of the nodes :
49
+ To test on a 32-device cluster spread across 4 nodes, run the following command on all nodes, alternating the rank from 0 to 3 and setting the master address to point to the rank-0 node :
52
50
51
+ ``` bash
52
+ export NODE_RANK=< rank>
53
+ export WORLD_SIZE=32
54
+ export WORLD_LOCAL_SIZE=8
55
+ export MASTER_ADDR=< master-address>
56
+ export MASTER_PORT=29500
57
+ export NVSHMEM_IB_ENABLE_IBGDA=1
58
+ python3 -m tests.bench_all_to_all
53
59
```
54
- cd pplx-kernels
55
- pip install -e . -vvv
56
- NVSHMEM_IB_ENABLE_IBGDA=1 NODE_RANK=<rank> WORLD_SIZE=32 WORLD_LOCAL_SIZE=8 MASTER_ADDR=<master-address> MASTER_PORT=29500 python3 -m tests.bench_all_to_all
57
- ```
60
+
61
+ # Benchmark Results
62
+
63
+ 1 token per GPU:
64
+
65
+ | 1 tok per GPU | EP128 | EP64 | EP32 | EP16 | EP8 |
66
+ | :------------------:| :-----------------:| :----------------:| :----------------:| :----------------:| :---------------:|
67
+ | NVLINK Dispatch | x | x | x | x | 41.6μs ± 1.3μs |
68
+ | IBGDA Dispatch | 125.9μs ± 0.6μs | 121.0μs ± 0.2μs | 115.7μs ± 1.4μs | 102.7μs ± 8.7μs | x |
69
+ | IBRC Dispatch | 488.4μs ± 51.0μs | 525.0μs ± 9.4μs | 421.2μs ± 35.5μs | 290.5μs ± 4.7μs | x |
70
+ | NVLINK Combine | x | x | x | x | 41.7μs ± 3.0μs |
71
+ | IBGDA Combine | 63.2μs ± 8.3μs | 58.6μs ± 1.0μs | 55.4μs ± 0.8μs | 62.7μs ± 0.7μs | x |
72
+ | IBRC Combine | 786.8μs ± 149.8μs | 400.0μs ± 47.9μs | 122.1μs ± 38.2μs | 85.9μs ± 5.3μs | x |
73
+ | Torch AtA | 132.0μs ± 25.9μs | 101.6μs ± 15.7μs | 95.7μs ± 14.3μs | 109.7μs ± 3.1μs | 24.4μs ± 16.3μs |
74
+ | NVLINK NVSHMEM AtA | x | x | x | x | 59.9μs ± 30.7μs |
75
+ | IBGDA NVSHMEM AtA | 132.4μs ± 73.3μs | 95.3μs ± 23.5μs | 77.3μs ± 23.0μs | 71.7μs ± 14.6μs | x |
76
+ | IBRC NVSHMEM AtA | 258.8μs ± 145.3μs | 98.9μs ± 57.1μs | 63.2μs ± 20.3μs | 55.4μs ± 12.6μs | x |
77
+
78
+
79
+ 128 tokens per GPU:
80
+
81
+ | 128 tok per GPU | EP128 | EP64 | EP32 | EP16 | EP8 |
82
+ | :------------------:| :------------------:| :------------------:| :------------------:| :-----------------:| :-----------------:|
83
+ | DeepEP Dispatch | 192μs | 186μs | 182μs | 173μs | 163μs |
84
+ | NVLINK Dispatch | x | x | x | x | 83.6μs ± 1.0μs |
85
+ | IBGDA Dispatch | 307.7μs ± 3.0μs | 317.4μs ± 1.5μs | 427.6μs ± 1.4μs | 622.4μs ± 1.7μs | x |
86
+ | IBRC Dispatch | 2038.5μs ± 77.0μs | 1669.3μs ± 64.0μs | 973.5μs ± 37.9μs | 687.1μs ± 12.9μs | x |
87
+ | DeepEP Combine | 369μs | 353μs | 350μs | 329μs | 318μs |
88
+ | NVLINK Combine | x | x | x | x | 102.3μs ± 0.6μs |
89
+ | IBGDA Combine | 593.9μs ± 6.6μs | 529.9μs ± 6.7μs | 481.4μs ± 3.6μs | 668.1μs ± 3.4μs | x |
90
+ | IBRC Combine | 1184.8μs ± 79.7μs | 1058.5μs ± 49.6μs | 916.5μs ± 45.1μs | 633.4μs ± 14.0μs | x |
91
+ | Torch AtA | 4972.0μs ± 135.8μs | 5418.1μs ± 241.4μs | 4225.9μs ± 69.5μs | 3213.9μs ± 19.7μs | 699.9μs ± 2.2μs |
92
+ | NVLINK NVSHMEM AtA | x | x | x | x | 6585.3μs ± 2.4μs |
93
+ | IBGDA NVSHMEM AtA | 6180.1μs ± 344.7μs | 6916.3μs ± 315.4μs | 4603.4μs ± 133.1μs | 3444.8μs ± 15.3μs | x |
94
+ | IBRC NVSHMEM AtA | 6378.5μs ± 375.9μs | 6625.1μs ± 371.3μs | 4371.3μs ± 148.8μs | 3410.1μs ± 20.2μs | x |
0 commit comments