Skip to content

Commit 50d830d

Browse files
committed
update readme
1 parent 514656b commit 50d830d

File tree

1 file changed

+49
-37
lines changed

1 file changed

+49
-37
lines changed

README.md

Lines changed: 49 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,41 @@
1-
Perplexity MoE Kernels
2-
==========
1+
# Perplexity MoE Kernels
32

4-
# Installation
3+
## Installation
54

65
```bash
76
cd pplx-kernels
87
TORCH_CUDA_ARCH_LIST=9.0a+PTX python3 setup.py bdist_wheel
98
pip install dist/*.whl
109
```
1110

12-
# Testing
11+
## Single-node Testing and Benchmarking
1312

14-
To build the C++ tests and benchmarks:
15-
16-
```bash
17-
cd pplx-kernels
18-
mkdir build-cmake
19-
cd build-cmake
20-
21-
TORCH_PREFIX_PATH=$(python3 -c 'import torch; print(torch.utils.cmake_prefix_path)')
22-
23-
cmake ../csrc \
24-
-GNinja \
25-
-DCMAKE_PREFIX_PATH=$TORCH_PREFIX_PATH \
26-
-DTORCH_CUDA_ARCH_LIST=9.0a+PTX \
27-
-DWITH_TESTS=ON \
28-
-DWITH_BENCHMARKS=ON
29-
30-
ninja test_all_to_all bench_all_to_all
31-
```
32-
33-
To run the all-to-all tests on one node:
13+
Test:
3414

3515
```bash
36-
NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./test_all_to_all
16+
pytest -svx --tb=short tests
3717
```
3818

39-
40-
To run the all-to-all benchmarks on one node:
19+
Benchmark:
4120

4221
```bash
43-
NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./bench_all_to_all
22+
python3 -m tests.bench_all_to_all
4423
```
4524

46-
47-
# Inter-Node Benchmarks
48-
49-
To test on a 32-device cluster spread across 4 nodes, run the following command on all nodes, alternating the rank from 0 to 3 and setting the master address to point to the rank-0 node:
25+
## Multi-node Testing and Benchmarking
5026

5127
```bash
52-
export NODE_RANK=<rank>
53-
export WORLD_SIZE=32
28+
export NODE_RANK= # 0, 1, ..., num_nodes-1
29+
export WORLD_SIZE= # num_nodes * 8
5430
export WORLD_LOCAL_SIZE=8
55-
export MASTER_ADDR=<master-address>
31+
export MASTER_ADDR= # IP address of rank-0 node
5632
export MASTER_PORT=29500
5733
export NVSHMEM_IB_ENABLE_IBGDA=1
58-
python3 -m tests.bench_all_to_all
5934
```
6035

61-
# Benchmark Results
36+
After settings these environment variables, commands to run the tests and benchmarks are the same as the single-node case.
37+
38+
## Benchmark Results
6239

6340
1 token per GPU:
6441

@@ -92,3 +69,38 @@ python3 -m tests.bench_all_to_all
9269
| NVLINK NVSHMEM AtA | x | x | x | x | 6585.3μs ± 2.4μs |
9370
| IBGDA NVSHMEM AtA | 6180.1μs ± 344.7μs | 6916.3μs ± 315.4μs | 4603.4μs ± 133.1μs | 3444.8μs ± 15.3μs | x |
9471
| IBRC NVSHMEM AtA | 6378.5μs ± 375.9μs | 6625.1μs ± 371.3μs | 4371.3μs ± 148.8μs | 3410.1μs ± 20.2μs | x |
72+
73+
74+
## C++ Testing
75+
76+
To build the C++ tests and benchmarks:
77+
78+
```bash
79+
cd pplx-kernels
80+
mkdir build-cmake
81+
cd build-cmake
82+
83+
TORCH_PREFIX_PATH=$(python3 -c 'import torch; print(torch.utils.cmake_prefix_path)')
84+
85+
cmake ../csrc \
86+
-GNinja \
87+
-DCMAKE_PREFIX_PATH=$TORCH_PREFIX_PATH \
88+
-DTORCH_CUDA_ARCH_LIST=9.0a+PTX \
89+
-DWITH_TESTS=ON \
90+
-DWITH_BENCHMARKS=ON
91+
92+
ninja test_all_to_all bench_all_to_all
93+
```
94+
95+
To run the all-to-all tests on one node:
96+
97+
```bash
98+
NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./test_all_to_all
99+
```
100+
101+
102+
To run the all-to-all benchmarks on one node:
103+
104+
```bash
105+
NVSHMEM_REMOTE_TRANSPORT=none mpirun -np 4 ./bench_all_to_all
106+
```

0 commit comments

Comments
 (0)