Skip to content

Commit c5d3454

Browse files
committed
Added updated TP results with new custom collectives: pytorch/pytorch#114001
1 parent 88a4d77 commit c5d3454

File tree

1 file changed

+14
-8
lines changed

1 file changed

+14
-8
lines changed

README.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
4444
```
4545

4646
## Benchmarks
47-
Benchmarks run on an A100-80GB, power limited to 330W.
47+
Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh topology. Note that all benchmarks are run at *batch size=1*, making the reported tokens/s numbers equivalent to "tokens/s/user". In addition, they are run with a very small prompt length (just 5 tokens).
4848

4949
| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) |
5050
| -------- | ------- | ------ | ------ |
@@ -62,13 +62,19 @@ Benchmarks run on an A100-80GB, power limited to 330W.
6262
| Model | Number of GPUs | Tokens/Second | Memory Bandwidth (GB/s) |
6363
| -------- | ------- | ------ | ------ |
6464
| Llama-2-7B | 1 | 104.9 | 1397.31 |
65-
| | 2 | 136.27 | 954.01 |
66-
| | 4 | 168.78 | 635.09 |
67-
| | 8 | 179.27 | 395.85 |
65+
| | 2 | 168.84 | 1181.99 |
66+
| | 4 | 254.02 | 955.83 |
67+
| | 8 | 328.43 | 704.10 |
6868
| Llama-2-70B | 1 | OOM | |
69-
| | 2 | 20.53 | 1426.41 |
70-
| | 4 | 34.15 | 1204.62 |
71-
| | 8 | 47.25 | 858.28 |
69+
| | 2 | 21.32 | 1481.87 |
70+
| | 4 | 38.01 | 1340.76 |
71+
| | 8 | 62.50 | 1135.29 |
72+
73+
### Tensor Parallelism + Quantization
74+
| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) |
75+
| Llama-2-70B | Base | 62.50 | 1135.29 |
76+
| | 8-bit | 80.44 | 752.04 |
77+
| | 4-bit (G=32) | 90.77 | 548.10 |
7278

7379
### AMD
7480
Benchmarks run on one GCD of a MI-250x.
@@ -126,7 +132,7 @@ Note: Running on an A100 80GB, albeit power-limited to 330 watts. Empirically, s
126132

127133
## Tensor Parallelism
128134
```bash
129-
torchrun --standalone --nproc_per_node=2 generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth
135+
ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=2 generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth
130136
```
131137

132138
## Experimental

0 commit comments

Comments
 (0)