comments

simon-mo · simon-mo · commit 3ca20af0bf9b · 2025-04-05T15:47:12.000-07:00
Signed-off-by: simon-mo &lt;simon.mo@hey.com&gt;
diff --git a/_posts/2025-04-05-llama4.md b/_posts/2025-04-05-llama4.md
@@ -7,7 +7,7 @@ thumbnail-img: /assets/figures/llama4/perf.png
 share-img: /assets/figures/llama4/perf.png
 ---
 
-We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8 images!), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later:
+We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10 images with good results), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later:
 
 ```
 pip install -U vllm
@@ -60,9 +60,10 @@ vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
 
 **Performance:**
 
-With the configurations above, we observe the following output tokens/s:
+With the configurations above, we observe the following output tokens/s. Note that Scout is smaller but runnning with bfloat 16 while Maverick is running with fp8.
 
 ![](/assets/figures/llama4/perf.png)
+
 While more performance enhancements are on the way, we believe the Llama 4 models' efficient architecture and relatively small size make them practical for scaled usage today.
 
 **Tips for Performance and Long Context:**
@@ -74,7 +75,7 @@ While more performance enhancements are on the way, we believe the Llama 4 model
 **Other Hardware Support & Quantizations:**
 
 * A100: We have verified that the bf16 versions of the models work well on A100 GPUs.
-* INT4: An INT4-quantized version of the model checkpoint is currently a work in progress. Stay tuned for updates.
+* INT4: An INT4-quantized version of the Scout model checkpoint is currently a work in progress. Stay tuned for updates.
 * AMD MI300X: You can run Llama 4 on AMD MI300X GPUs by building [vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm) and using the same commands as above.
 
 **Inference Accuracy Validation:**