Merge pull request #166 from yanboliang/llama3-8b

yanboliang · web-flow · commit e71d26801288 · 2024-06-16T19:51:45.000-07:00
Llama3 8b perf numbers on A100
diff --git a/README.md b/README.md
@@ -70,6 +70,7 @@ codellama/CodeLlama-34b-Python-hf
 mistralai/Mistral-7B-v0.1
 mistralai/Mistral-7B-Instruct-v0.1
 mistralai/Mistral-7B-Instruct-v0.2
+meta-llama/Meta-Llama-3-8B
 ```
 
 For example, to convert Llama-2-7b-chat-hf
@@ -89,6 +90,8 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh
 | Llama-2-70B | Base    | OOM     ||
 |           | 8-bit   | 19.13    | 1322.58 |
 |           | 4-bit (G=32)   | 25.25    | 1097.66 |
+| Llama-3-8B  | Base    |  94.25  | 1411.95 |
+|           | 8-bit   | 139.55   | 1047.23 |
 
 ### Speculative Sampling
 [Verifier: Llama-70B (int4), Draft: Llama-7B (int4)](./scripts/speculate_70B_int4.sh): 48.4 tok/s
@@ -104,6 +107,10 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh
 |           | 2   | 21.32   | 1481.87 |
 |           | 4   | 38.01   | 1340.76 |
 |           | 8   | 62.50   | 1135.29 |
+| Llama-3-8B  | 1    |  94.19  | 1411.76 |
+|           | 2   | 150.48   | 1208.80 |
+|           | 4   | 219.77   | 991.63 |
+|           | 8   | 274.65   | 768.55 |
 
 ### Tensor Parallelism + Quantization
 | Model    | Technique | Tokens/Second | Memory Bandwidth (GB/s) |