Skip to content

Commit 093b889

Browse files
Refine language and formatting in inference guide
1 parent 617b726 commit 093b889

File tree

1 file changed

+16
-15
lines changed

1 file changed

+16
-15
lines changed

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,17 @@ layout: learningpathall
99
## Batch Sizing in vLLM
1010

1111
vLLM uses dynamic continuous batching to maximize hardware utilization. Two key parameters govern this process:
12-
* `max_model_len` — The maximum sequence length (number of tokens per request).
12+
* `max_model_len`, which is the maximum sequence length (number of tokens per request).
1313
No single prompt or generated sequence can exceed this limit.
14-
* `max_num_batched_tokens` — The total number of tokens processed in one batch across all requests.
14+
* `max_num_batched_tokens`, which is the total number of tokens processed in one batch across all requests.
1515
The sum of input and output tokens from all concurrent requests must stay within this limit.
1616

1717
Together, these parameters determine how much memory the model can use and how effectively CPU threads are saturated.
1818
On Arm-based servers, tuning them helps achieve stable throughput while avoiding excessive paging or cache thrashing.
1919

2020
## Serve an OpenAI‑compatible API
2121

22-
Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance.
22+
Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance:
2323

2424
```bash
2525
export VLLM_TARGET_DEVICE=cpu
@@ -125,9 +125,9 @@ This validates multi‑request behavior and shows aggregate throughput in the se
125125
(APIServer pid=4474) INFO: 127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
126126
(APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
127127
```
128-
## Optional: Serve a BF16 (Non-Quantized) Model
128+
## Serve a BF16 (non-quantized) model (optional)
129129

130-
For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels via ACL under aarch64).
130+
For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels using ACL under aarch64).
131131

132132
```bash
133133
vllm serve deepseek-ai/DeepSeek-V2-Lite \
@@ -136,17 +136,18 @@ vllm serve deepseek-ai/DeepSeek-V2-Lite \
136136
```
137137
Use this BF16 setup to establish a quality reference baseline, then compare throughput and latency against your INT4 deployment to quantify the performance/accuracy trade-offs on your Arm system.
138138

139-
## Go Beyond: Power Up Your vLLM Workflow
139+
## Go beyond: power up your vLLM workflow
140140
Now that you’ve successfully quantized, served, and benchmarked a model using vLLM on Arm, you can build on what you’ve learned to push performance, scalability, and usability even further.
141141

142-
**Try Different Models**
143-
Extend your workflow to other models on Hugging Face that are compatible with vLLM and can benefit from Arm acceleration:
144-
* Meta Llama 2 / Llama 3 – Strong general-purpose baselines; excellent for comparing BF16 vs INT4 performance.
145-
* Qwen / Qwen-Chat – High-quality multilingual and instruction-tuned models.
146-
* Gemma (Google) – Compact and efficient architecture; ideal for edge or cost-optimized serving.
147-
148-
You can quantize and serve them using the same `quantize_vllm_models.py` recipe, just update the model name.
142+
## Try different models
143+
Explore other Hugging Face models that work well with vLLM and take advantage of Arm acceleration:
149144

150-
**Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
145+
- Meta Llama 2 and Llama 3: these versatile models work well for general tasks, and you can try them to compare BF16 and INT4 performance
146+
- Qwen and Qwen-Chat: these models support multiple languages and are tuned for instructions, giving you high-quality results
147+
- Gemma (Google): this compact and efficient model is a good choice for edge devices or deployments where cost matters
151148

152-
You can continue exploring how Arm’s efficiency, oneDNN+ACL acceleration, and vLLM’s dynamic batching combine to deliver fast, sustainable, and scalable AI inference on modern Arm architectures.
149+
You can quantize and serve any of these models using the same `quantize_vllm_models.py` script. Just update the model name in the script.
150+
151+
You can also try connecting a chat client by linking your server with OpenAI-compatible user interfaces such as [Open WebUI](https://github.com/open-webui/open-webui).
152+
153+
Continue exploring how Arm efficiency, oneDNN and ACL acceleration, and vLLM dynamic batching work together to provide fast, sustainable, and scalable AI inference on modern Arm architectures.

0 commit comments

Comments
 (0)