Skip to content

Commit 3ca20af

Browse files
committed
comments
Signed-off-by: simon-mo <[email protected]>
1 parent 99c0847 commit 3ca20af

File tree

1 file changed

+4
-3
lines changed

1 file changed

+4
-3
lines changed

_posts/2025-04-05-llama4.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ thumbnail-img: /assets/figures/llama4/perf.png
77
share-img: /assets/figures/llama4/perf.png
88
---
99

10-
We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8 images!), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later:
10+
We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10 images with good results), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later:
1111

1212
```
1313
pip install -U vllm
@@ -60,9 +60,10 @@ vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
6060

6161
**Performance:**
6262

63-
With the configurations above, we observe the following output tokens/s:
63+
With the configurations above, we observe the following output tokens/s. Note that Scout is smaller but runnning with bfloat 16 while Maverick is running with fp8.
6464

6565
![](/assets/figures/llama4/perf.png)
66+
6667
While more performance enhancements are on the way, we believe the Llama 4 models' efficient architecture and relatively small size make them practical for scaled usage today.
6768

6869
**Tips for Performance and Long Context:**
@@ -74,7 +75,7 @@ While more performance enhancements are on the way, we believe the Llama 4 model
7475
**Other Hardware Support & Quantizations:**
7576

7677
* A100: We have verified that the bf16 versions of the models work well on A100 GPUs.
77-
* INT4: An INT4-quantized version of the model checkpoint is currently a work in progress. Stay tuned for updates.
78+
* INT4: An INT4-quantized version of the Scout model checkpoint is currently a work in progress. Stay tuned for updates.
7879
* AMD MI300X: You can run Llama 4 on AMD MI300X GPUs by building [vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm) and using the same commands as above.
7980

8081
**Inference Accuracy Validation:**

0 commit comments

Comments
 (0)