Skip to content

Commit 7af3059

Browse files
committed
add benchmark results
1 parent 88bd737 commit 7af3059

File tree

1 file changed

+55
-18
lines changed

1 file changed

+55
-18
lines changed

README.md

Lines changed: 55 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,19 @@
11
Perplexity MoE Kernels
22
==========
33

4-
Installation
5-
-----
4+
# Installation
65

7-
```
6+
```bash
87
cd pplx-kernels
9-
pip install -e . -vvv
8+
TORCH_CUDA_ARCH_LIST=9.0a+PTX python3 setup.py bdist_wheel
9+
pip install dist/*.whl
1010
```
1111

12-
Testing
13-
-----
12+
# Testing
1413

1514
To build the C++ tests and benchmarks:
1615

17-
```
16+
```bash
1817
cd pplx-kernels
1918
mkdir build-cmake
2019
cd build-cmake
@@ -33,25 +32,63 @@ ninja test_all_to_all bench_all_to_all
3332

3433
To run the all-to-all tests on one node:
3534

36-
```
37-
NVSHMEM_REMOTE_TRANSPORT=None mpirun -np 4 ./test_all_to_all
35+
```bash
36+
NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./test_all_to_all
3837
```
3938

4039

4140
To run the all-to-all benchmarks on one node:
4241

43-
```
44-
NVSHMEM_REMOTE_TRANSPORT=None mpirun -np 4 ./bench_all_to_all
42+
```bash
43+
NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./bench_all_to_all
4544
```
4645

4746

48-
Inter-Node Benchmarks
49-
-----
47+
# Inter-Node Benchmarks
5048

51-
To test on a 32-device cluster spread across 4 nodes, run the following command on all nodes, alternating the rank from 0 to 3 and setting the master address to point to one of the nodes:
49+
To test on a 32-device cluster spread across 4 nodes, run the following command on all nodes, alternating the rank from 0 to 3 and setting the master address to point to the rank-0 node:
5250

51+
```bash
52+
export NODE_RANK=<rank>
53+
export WORLD_SIZE=32
54+
export WORLD_LOCAL_SIZE=8
55+
export MASTER_ADDR=<master-address>
56+
export MASTER_PORT=29500
57+
export NVSHMEM_IB_ENABLE_IBGDA=1
58+
python3 -m tests.bench_all_to_all
5359
```
54-
cd pplx-kernels
55-
pip install -e . -vvv
56-
NVSHMEM_IB_ENABLE_IBGDA=1 NODE_RANK=<rank> WORLD_SIZE=32 WORLD_LOCAL_SIZE=8 MASTER_ADDR=<master-address> MASTER_PORT=29500 python3 -m tests.bench_all_to_all
57-
```
60+
61+
# Benchmark Results
62+
63+
1 token per GPU:
64+
65+
| 1 tok per GPU | EP128 | EP64 | EP32 | EP16 | EP8 |
66+
|:------------------:|:-----------------:|:----------------:|:----------------:|:----------------:|:---------------:|
67+
| NVLINK Dispatch | x | x | x | x | 41.6μs ± 1.3μs |
68+
| IBGDA Dispatch | 125.9μs ± 0.6μs | 121.0μs ± 0.2μs | 115.7μs ± 1.4μs | 102.7μs ± 8.7μs | x |
69+
| IBRC Dispatch | 488.4μs ± 51.0μs | 525.0μs ± 9.4μs | 421.2μs ± 35.5μs | 290.5μs ± 4.7μs | x |
70+
| NVLINK Combine | x | x | x | x | 41.7μs ± 3.0μs |
71+
| IBGDA Combine | 63.2μs ± 8.3μs | 58.6μs ± 1.0μs | 55.4μs ± 0.8μs | 62.7μs ± 0.7μs | x |
72+
| IBRC Combine | 786.8μs ± 149.8μs | 400.0μs ± 47.9μs | 122.1μs ± 38.2μs | 85.9μs ± 5.3μs | x |
73+
| Torch AtA | 132.0μs ± 25.9μs | 101.6μs ± 15.7μs | 95.7μs ± 14.3μs | 109.7μs ± 3.1μs | 24.4μs ± 16.3μs |
74+
| NVLINK NVSHMEM AtA | x | x | x | x | 59.9μs ± 30.7μs |
75+
| IBGDA NVSHMEM AtA | 132.4μs ± 73.3μs | 95.3μs ± 23.5μs | 77.3μs ± 23.0μs | 71.7μs ± 14.6μs | x |
76+
| IBRC NVSHMEM AtA | 258.8μs ± 145.3μs | 98.9μs ± 57.1μs | 63.2μs ± 20.3μs | 55.4μs ± 12.6μs | x |
77+
78+
79+
128 tokens per GPU:
80+
81+
| 128 tok per GPU | EP128 | EP64 | EP32 | EP16 | EP8 |
82+
|:------------------:|:------------------:|:------------------:|:------------------:|:-----------------:|:-----------------:|
83+
| DeepEP Dispatch | 192μs | 186μs | 182μs | 173μs | 163μs |
84+
| NVLINK Dispatch | x | x | x | x | 83.6μs ± 1.0μs |
85+
| IBGDA Dispatch | 307.7μs ± 3.0μs | 317.4μs ± 1.5μs | 427.6μs ± 1.4μs | 622.4μs ± 1.7μs | x |
86+
| IBRC Dispatch | 2038.5μs ± 77.0μs | 1669.3μs ± 64.0μs | 973.5μs ± 37.9μs | 687.1μs ± 12.9μs | x |
87+
| DeepEP Combine | 369μs | 353μs | 350μs | 329μs | 318μs |
88+
| NVLINK Combine | x | x | x | x | 102.3μs ± 0.6μs |
89+
| IBGDA Combine | 593.9μs ± 6.6μs | 529.9μs ± 6.7μs | 481.4μs ± 3.6μs | 668.1μs ± 3.4μs | x |
90+
| IBRC Combine | 1184.8μs ± 79.7μs | 1058.5μs ± 49.6μs | 916.5μs ± 45.1μs | 633.4μs ± 14.0μs | x |
91+
| Torch AtA | 4972.0μs ± 135.8μs | 5418.1μs ± 241.4μs | 4225.9μs ± 69.5μs | 3213.9μs ± 19.7μs | 699.9μs ± 2.2μs |
92+
| NVLINK NVSHMEM AtA | x | x | x | x | 6585.3μs ± 2.4μs |
93+
| IBGDA NVSHMEM AtA | 6180.1μs ± 344.7μs | 6916.3μs ± 315.4μs | 4603.4μs ± 133.1μs | 3444.8μs ± 15.3μs | x |
94+
| IBRC NVSHMEM AtA | 6378.5μs ± 375.9μs | 6625.1μs ± 371.3μs | 4371.3μs ± 148.8μs | 3410.1μs ± 20.2μs | x |

0 commit comments

Comments
 (0)