You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels via ACL under aarch64).
130
+
For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels using ACL under aarch64).
Use this BF16 setup to establish a quality reference baseline, then compare throughput and latency against your INT4 deployment to quantify the performance/accuracy trade-offs on your Arm system.
138
138
139
-
## Go Beyond: Power Up Your vLLM Workflow
139
+
## Go beyond: power up your vLLM workflow
140
140
Now that you’ve successfully quantized, served, and benchmarked a model using vLLM on Arm, you can build on what you’ve learned to push performance, scalability, and usability even further.
141
141
142
-
**Try Different Models**
143
-
Extend your workflow to other models on Hugging Face that are compatible with vLLM and can benefit from Arm acceleration:
144
-
* Meta Llama 2 / Llama 3 – Strong general-purpose baselines; excellent for comparing BF16 vs INT4 performance.
145
-
* Qwen / Qwen-Chat – High-quality multilingual and instruction-tuned models.
146
-
* Gemma (Google) – Compact and efficient architecture; ideal for edge or cost-optimized serving.
147
-
148
-
You can quantize and serve them using the same `quantize_vllm_models.py` recipe, just update the model name.
142
+
## Try different models
143
+
Explore other Hugging Face models that work well with vLLM and take advantage of Arm acceleration:
149
144
150
-
**Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
145
+
- Meta Llama 2 and Llama 3: these versatile models work well for general tasks, and you can try them to compare BF16 and INT4 performance
146
+
- Qwen and Qwen-Chat: these models support multiple languages and are tuned for instructions, giving you high-quality results
147
+
- Gemma (Google): this compact and efficient model is a good choice for edge devices or deployments where cost matters
151
148
152
-
You can continue exploring how Arm’s efficiency, oneDNN+ACL acceleration, and vLLM’s dynamic batching combine to deliver fast, sustainable, and scalable AI inference on modern Arm architectures.
149
+
You can quantize and serve any of these models using the same `quantize_vllm_models.py` script. Just update the model name in the script.
150
+
151
+
You can also try connecting a chat client by linking your server with OpenAI-compatible user interfaces such as [Open WebUI](https://github.com/open-webui/open-webui).
152
+
153
+
Continue exploring how Arm efficiency, oneDNN and ACL acceleration, and vLLM dynamic batching work together to provide fast, sustainable, and scalable AI inference on modern Arm architectures.
0 commit comments