vllm-project
diff --git a/‎_posts/2025-04-05-llama4.md‎
Lines changed: 3 additions & 3 deletions b/‎_posts/2025-04-05-llama4.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎assets/figures/llama4/perf.png‎
38.2 KB b/‎assets/figures/llama4/perf.png‎
38.2 KB
@@ -60,7 +60,7 @@ vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
 
 **Performance:**
 
-With the configurations above, we observe the following output tokens/s. Note that Scout is smaller but runnning with bfloat 16 while Maverick is running with fp8.
+With the configurations above, we observe the following output tokens/s for Scout-BF16 and Maverick-FP8:
 
 ![](/assets/figures/llama4/perf.png)
 
@@ -75,7 +75,7 @@ While more performance enhancements are on the way, we believe the Llama 4 model
 **Other Hardware Support & Quantizations:**
 
 * A100: We have verified that the bf16 versions of the models work well on A100 GPUs.
-* INT4: An INT4-quantized version of the Scout model checkpoint is currently a work in progress. Stay tuned for updates.
+* INT4: An INT4-quantized version of the Scout model checkpoint that fits on a single H100 GPUis currently a work in progress. Stay tuned for updates.
 * AMD MI300X: You can run Llama 4 on AMD MI300X GPUs by building [vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm) and using the same commands as above.
 
 **Inference Accuracy Validation:**
@@ -85,7 +85,7 @@ We validated inference accuracy against the official Meta report using lm-eval-h
 |----------|---------|---------|
 | Reported | 80.5 | 90 |
 | H100 FP8 | 80.4 | 89.4 |
-| AMD BF16 | 80.4 | 89.4 |
+| AMD MI300x BF16 | 80.4 | 89.4 |
 | H200 BF16 | 80.2 | 89.3 |
 
 ## Efficient Architecture and Cluster Scale Serving