Skip to content

Commit e71d268

Browse files
authored
Merge pull request #166 from yanboliang/llama3-8b
Llama3 8b perf numbers on A100
2 parents 9b908fb + 744c927 commit e71d268

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ codellama/CodeLlama-34b-Python-hf
7070
mistralai/Mistral-7B-v0.1
7171
mistralai/Mistral-7B-Instruct-v0.1
7272
mistralai/Mistral-7B-Instruct-v0.2
73+
meta-llama/Meta-Llama-3-8B
7374
```
7475

7576
For example, to convert Llama-2-7b-chat-hf
@@ -89,6 +90,8 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh
8990
| Llama-2-70B | Base | OOM ||
9091
| | 8-bit | 19.13 | 1322.58 |
9192
| | 4-bit (G=32) | 25.25 | 1097.66 |
93+
| Llama-3-8B | Base | 94.25 | 1411.95 |
94+
| | 8-bit | 139.55 | 1047.23 |
9295

9396
### Speculative Sampling
9497
[Verifier: Llama-70B (int4), Draft: Llama-7B (int4)](./scripts/speculate_70B_int4.sh): 48.4 tok/s
@@ -104,6 +107,10 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh
104107
| | 2 | 21.32 | 1481.87 |
105108
| | 4 | 38.01 | 1340.76 |
106109
| | 8 | 62.50 | 1135.29 |
110+
| Llama-3-8B | 1 | 94.19 | 1411.76 |
111+
| | 2 | 150.48 | 1208.80 |
112+
| | 4 | 219.77 | 991.63 |
113+
| | 8 | 274.65 | 768.55 |
107114

108115
### Tensor Parallelism + Quantization
109116
| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) |

0 commit comments

Comments
 (0)