Skip to content

Commit 6b13787

Browse files
committed
[feat]: Add learning path for vLLM high throughput serving on aarch64 server CPUs
Signed-off-by: Nikhil Gupta <[email protected]>
1 parent e8bad70 commit 6b13787

File tree

5 files changed

+466
-0
lines changed

5 files changed

+466
-0
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
title: Overview and Optimized Build
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## What is vLLM?
10+
11+
vLLM is an open‑source, high‑throughput inference and serving engine for large language models. It focuses on efficient execution of the LLM inference prefill and decode phases with:
12+
13+
- Continuous batching to keep hardware busy across many requests.
14+
- KV cache management to sustain concurrency during generation.
15+
- Token streaming so results appear as they are produced.
16+
17+
You interact with vLLM in multiple ways:
18+
19+
- OpenAI‑compatible server: expose `/v1/chat/completions` for easy integration.
20+
- Python API: load a model and generate locally when needed.
21+
22+
vLLM works well with Hugging Face models, supports single‑prompt and batch workloads, and scales from quick tests to production serving.
23+
24+
## What you build
25+
26+
You build a CPU‑optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL). You then validate the build with a quick offline chat example.
27+
28+
## Why this is fast on Arm
29+
30+
- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
31+
- 4‑bit weight quantization: INT4 quantization support & acceleration by Arm KleidiAI microkernels.
32+
- Efficient MoE execution: Fused INT4 quantized expert layers reduce memory traffic and improve throughput.
33+
- Optimized Paged attention: Arm SIMD tuned paged attention implementation in vLLM.
34+
- System tuning: Thread affinity and `tcmalloc` help keep latency and allocator overhead low under load.
35+
36+
## Before you begin
37+
38+
- Use Python 3.12 on Ubuntu 22.04+
39+
- Make sure you have at least 32 vCPUs, 64 GB RAM, and 32 GB free disk.
40+
41+
Install the minimum system package used by vLLM on Arm:
42+
43+
```bash
44+
sudo apt-get update -y
45+
sudo apt-get install -y libnuma-dev
46+
```
47+
48+
Optional performance helper you can install now or later:
49+
50+
```bash
51+
sudo apt-get install -y libtcmalloc-minimal4
52+
```
53+
54+
{{% notice Note %}}
55+
On aarch64, vLLM’s CPU backend automatically builds with Arm Compute Library via oneDNN.
56+
{{% /notice %}}
57+
58+
## Build vLLM for aarch64 CPU
59+
60+
Create and activate a virtual environment:
61+
62+
```bash
63+
python3 -m venv vllm_env
64+
source vllm_env/bin/activate
65+
python -m pip install --upgrade pip
66+
```
67+
68+
Clone vLLM and install build requirements:
69+
70+
```bash
71+
git clone https://github.com/vllm-project/vllm.git
72+
cd vllm
73+
git checkout 5fb4137
74+
pip install -r requirements/cpu.txt -r requirements/cpu-build.txt
75+
```
76+
77+
Build a wheel targeted at CPU:
78+
79+
```bash
80+
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
81+
```
82+
83+
Install the wheel. Use `--no-deps` for incremental installs to avoid clobbering your environment:
84+
85+
```bash
86+
pip install --force-reinstall dist/*.whl # fresh install
87+
# pip install --no-deps --force-reinstall dist/*.whl # incremental build
88+
```
89+
90+
{{% notice Tip %}}
91+
Do NOT delete vLLM repo. Local vLLM repository is required for corect inferencing on aarch64 CPU after installing the wheel.
92+
{{% /notice %}}
93+
94+
## Quick validation via offline inferencing
95+
96+
Run the built‑in chat example to confirm the build:
97+
98+
```bash
99+
python examples/offline_inference/basic/chat.py \
100+
--dtype=bfloat16 \
101+
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
102+
```
103+
104+
You should see tokens streaming and a final response. This verifies the optimized vLLM build on your Arm server.
105+
106+
```output
107+
Generated Outputs:
108+
--------------------------------------------------------------------------------
109+
Prompt: None
110+
111+
Generated text: 'The Importance of Higher Education\n\nHigher education is a fundamental right'
112+
--------------------------------------------------------------------------------
113+
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9552.05it/s]
114+
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.78it/s, est. speed input: 474.32 toks/s, output: 108.42 toks/s]
115+
...
116+
```
117+
118+
{{% notice Note %}}
119+
As CPU support in vLLM continues to mature, manual builds will be replaced by a simple `pip install` flow for easier setup in near future.
120+
{{% /notice %}}
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
title: Quantize an LLM to INT4 for Arm Platform
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this guide, we use `deepseek-ai/DeepSeek-V2-Lite` as the example model which gets accelerated by the INT4 path in vLLM using Arm KleidiAI microkernels.
10+
11+
## Install quantization tools
12+
13+
Install the vLLM model quantization packages
14+
15+
```bash
16+
pip install --no-deps compressed-tensors
17+
pip install llmcompressor
18+
```
19+
20+
Reinstall your locally built vLLM if you rebuilt it:
21+
22+
```bash
23+
pip install --no-deps dist/*.whl
24+
```
25+
26+
If your chosen model is gated on Hugging Face, authenticate first:
27+
28+
```bash
29+
huggingface-cli login
30+
```
31+
32+
## INT4 Quantization recipe
33+
34+
Save the following as `quantize_vllm_models.py`:
35+
36+
```python
37+
import argparse
38+
import os
39+
import torch
40+
from transformers import AutoModelForCausalLM, AutoTokenizer
41+
42+
from compressed_tensors.quantization import QuantizationScheme
43+
from compressed_tensors.quantization.quant_args import (
44+
QuantizationArgs,
45+
QuantizationStrategy,
46+
QuantizationType,
47+
)
48+
49+
from llmcompressor import oneshot
50+
from llmcompressor.modifiers.quantization.quantization import QuantizationModifier
51+
52+
53+
def main():
54+
parser = argparse.ArgumentParser(description="Quantize a model using INT4 with minmax or mse and dynamic activation quantization.")
55+
parser.add_argument("model_id", type=str, help="Model identifier or path")
56+
parser.add_argument("--method", type=str, choices=["minmax", "mse"], default="mse", help="Quantization method")
57+
parser.add_argument("--scheme", type=str, choices=["channelwise", "groupwise"], required=True, help="Quantization scheme for weights")
58+
parser.add_argument("--groupsize", type=int, default=32, help="Group size for groupwise quantization")
59+
args = parser.parse_args()
60+
61+
# Extract base model name for output dir
62+
base_model_name = os.path.basename(args.model_id.rstrip("/"))
63+
act_tag = "a8dyn"
64+
suffix = f"{args.method}-{args.scheme}"
65+
if args.scheme == "groupwise":
66+
suffix += f"-g{args.groupsize}"
67+
output_dir = f"{base_model_name}-w4{act_tag}-{suffix}"
68+
69+
print(f"Loading model '{args.model_id}'...")
70+
model = AutoModelForCausalLM.from_pretrained(
71+
args.model_id, trust_remote_code=True
72+
)
73+
model = model.to(torch.float32)
74+
tokenizer = AutoTokenizer.from_pretrained(args.model_id)
75+
76+
# Weight quantization args
77+
strategy = QuantizationStrategy.CHANNEL if args.scheme == "channelwise" else QuantizationStrategy.GROUP
78+
weights_args = QuantizationArgs(
79+
num_bits=4,
80+
type=QuantizationType.INT,
81+
strategy=strategy,
82+
symmetric=True,
83+
dynamic=False,
84+
group_size=args.groupsize if args.scheme == "groupwise" else None,
85+
observer=args.method,
86+
)
87+
88+
# Activation quantization
89+
input_acts = QuantizationArgs(
90+
num_bits=8,
91+
type=QuantizationType.INT,
92+
strategy=QuantizationStrategy.TOKEN,
93+
symmetric=False,
94+
dynamic=True,
95+
observer=None,
96+
)
97+
output_acts = None
98+
99+
# Create quantization scheme and recipe
100+
scheme = QuantizationScheme(
101+
targets=["Linear"],
102+
weights=weights_args,
103+
input_activations=input_acts,
104+
output_activations=output_acts,
105+
)
106+
recipe = QuantizationModifier(
107+
config_groups={"group_0": scheme},
108+
ignore=["lm_head"],
109+
)
110+
111+
# Run quantization
112+
oneshot(
113+
model=model,
114+
recipe=recipe,
115+
tokenizer=tokenizer,
116+
output_dir=output_dir,
117+
trust_remote_code_model=True,
118+
)
119+
120+
print(f"Quantized model saved to: {output_dir}")
121+
122+
123+
if __name__ == "__main__":
124+
main()
125+
```
126+
127+
This script creates a Arm KleidiAI 4‑bit quantized copy of the vLLM model and saves it to a new directory.
128+
129+
## Quantize DeepSeek‑V2‑Lite model
130+
131+
### Quantization parameter tuning
132+
1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method.
133+
2. `channelwise` is a good default for most models.
134+
3. `groupwise` can improve accuracy further; `--groupsize 32` is common.
135+
136+
```bash
137+
# DeepSeek example
138+
python quantize_vllm_models.py deepseek-ai/DeepSeek-V2-Lite \
139+
--scheme channelwise --method mse
140+
```
141+
142+
The 4-bit quantized DeepSeek-V2-Lite will be stored the directory:
143+
144+
```text
145+
DeepSeek-V2-Lite-w4a8dyn-mse-channelwise
146+
```
147+
148+
You will load this quantized model directory with vLLM in the next step.
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
---
2+
title: Serve high throughput inference with vLLM
3+
weight: 4
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## About batch sizing in vLLM
10+
11+
vLLM enforces two limits to balance memory use and throughput: a per‑sequence length (`max_model_len`) and a per‑batch token limit (`max_num_batched_tokens`). No single request can exceed the sequence limit, and the sum of tokens in a batch must stay within the batch limit.
12+
13+
## Serve an OpenAI‑compatible API
14+
15+
Start the server with sensible CPU default parameters and a quantized model:
16+
17+
```bash
18+
export VLLM_TARGET_DEVICE=cpu
19+
export VLLM_CPU_KVCACHE_SPACE=32
20+
export VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))"
21+
export VLLM_MLA_DISABLE=1
22+
export ONEDNN_DEFAULT_FPMATH_MODE=BF16
23+
export OMP_NUM_THREADS="$(nproc)"
24+
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4
25+
26+
vllm serve DeepSeek-V2-Lite-w4a8dyn-mse-channelwise \
27+
--dtype float32 --max-model-len 4096 --max-num-batched-tokens 4096
28+
```
29+
30+
## Run multi‑request batch
31+
32+
After confirming a single request works explained in previous example, simulate concurrent load with a small OpenAI API compatible client. Save this as `batch_test.py`:
33+
34+
```python
35+
import asyncio
36+
import time
37+
from openai import AsyncOpenAI
38+
39+
# vLLM's OpenAI-compatible server
40+
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
41+
42+
model = "DeepSeek-V2-Lite-w4a8dyn-mse-channelwise" # vllm server model
43+
44+
# Batch of 8 prompts
45+
messages_list = [
46+
[{"role": "user", "content": "Explain Big O notation with two examples."}],
47+
[{"role": "user", "content": "Show a simple recursive function and explain how it works."}],
48+
[{"role": "user", "content": "Draft a polite email requesting a project deadline extension."}],
49+
[{"role": "user", "content": "Explain what a hash function is and common uses."}],
50+
[{"role": "user", "content": "Explain binary search and its time complexity."}],
51+
[{"role": "user", "content": "Write a Python function that checks if a string is a palindrome."}],
52+
[{"role": "user", "content": "Explain how caching improves performance with a simple analogy."}],
53+
[{"role": "user", "content": "Explain the difference between supervised and unsupervised learning."}],
54+
]
55+
56+
CONCURRENCY = 8
57+
58+
async def run_one(i: int, messages):
59+
resp = await client.chat.completions.create(
60+
model=model,
61+
messages=messages,
62+
max_tokens=128, # Change as per comfiguration
63+
)
64+
return i, resp.choices[0].message.content
65+
66+
async def main():
67+
t0 = time.time()
68+
sem = asyncio.Semaphore(CONCURRENCY)
69+
70+
async def guarded_run(i, msgs):
71+
async with sem:
72+
try:
73+
return await run_one(i, msgs)
74+
except Exception as e:
75+
return i, f"[ERROR] {type(e).__name__}: {e}"
76+
77+
tasks = [asyncio.create_task(guarded_run(i, msgs)) for i, msgs in enumerate(messages_list, start=1)]
78+
results = await asyncio.gather(*tasks) # order corresponds to tasks list
79+
80+
# Print outputs in input order
81+
results.sort(key=lambda x: x[0])
82+
for idx, out in results:
83+
print(f"\n=== Output {idx} ===\n{out}\n")
84+
85+
print(f"Batch completed in : {time.time() - t0:.2f}s")
86+
87+
if __name__ == "__main__":
88+
asyncio.run(main())
89+
```
90+
91+
Run 8 concurrent requests against your server:
92+
93+
```bash
94+
python3 batch_test.py
95+
```
96+
97+
This validates multi‑request behavior and shows aggregate throughput in the server logs.
98+
99+
```output
100+
(APIServer pid=4474) INFO 11-10 01:00:56 [loggers.py:221] Engine 000: Avg prompt throughput: 19.7 tokens/s, Avg generation throughput: 187.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 0.0%
101+
(APIServer pid=4474) INFO: 127.0.0.1:44060 - "POST /v1/chat/completions HTTP/1.1" 200 OK
102+
(APIServer pid=4474) INFO: 127.0.0.1:44134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
103+
(APIServer pid=4474) INFO: 127.0.0.1:44076 - "POST /v1/chat/completions HTTP/1.1" 200 OK
104+
(APIServer pid=4474) INFO: 127.0.0.1:44068 - "POST /v1/chat/completions HTTP/1.1" 200 OK
105+
(APIServer pid=4474) INFO: 127.0.0.1:44100 - "POST /v1/chat/completions HTTP/1.1" 200 OK
106+
(APIServer pid=4474) INFO: 127.0.0.1:44112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
107+
(APIServer pid=4474) INFO: 127.0.0.1:44090 - "POST /v1/chat/completions HTTP/1.1" 200 OK
108+
(APIServer pid=4474) INFO: 127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
109+
(APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
110+
```
111+
## Optional: Serving BF16 non-quantized model
112+
113+
For a BF16 path on Arm, vLLM is acclerated by direct oneDNN integration in vLLM which allows aarch64 model to be hyperoptimized.
114+
115+
```bash
116+
vllm serve deepseek-ai/DeepSeek-V2-Lite \
117+
--dtype bfloat16 --max-model-len 4096 \
118+
--max-num-batched-tokens 4096
119+
```
120+
121+
## Go Beyond: Power Up Your vLLM Workflow
122+
Now that you’ve successfully quantized and served a model using vLLM on Arm, here are some further ways to explore:
123+
124+
* **Try different models:** Apply the same steps to other [Hugging Face models](https://huggingface.co/models) like Llama, Qwen or Gemma.
125+
126+
* **Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)

0 commit comments

Comments
 (0)