Skip to content

Commit c7cf14a

Browse files
authored
Merge pull request #2540 from pareenaverma/content_review
Tech review of INT4 vllm LP
2 parents 6345817 + db0058c commit c7cf14a

File tree

4 files changed

+126
-62
lines changed

4 files changed

+126
-62
lines changed

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md

Lines changed: 63 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -8,101 +8,128 @@ layout: learningpathall
88

99
## What is vLLM?
1010

11-
vLLM is an open‑source, high‑throughput inference and serving engine for large language models. It focuses on efficient execution of the LLM inference prefill and decode phases with:
12-
13-
- Continuous batching to keep hardware busy across many requests.
14-
- KV cache management to sustain concurrency during generation.
15-
- Token streaming so results appear as they are produced.
16-
17-
You interact with vLLM in multiple ways:
18-
19-
- OpenAI‑compatible server: expose `/v1/chat/completions` for easy integration.
20-
- Python API: load a model and generate locally when needed.
21-
22-
vLLM works well with Hugging Face models, supports single‑prompt and batch workloads, and scales from quick tests to production serving.
11+
vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs).
12+
It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference.
13+
14+
### Key Features
15+
* Continuous Batching – Dynamically combines incoming inference requests into a single large batch, maximizing CPU/GPU utilization and throughput.
16+
* KV Cache Management – Efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead.
17+
* Token Streaming – Streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios.
18+
### Interaction Modes
19+
You can use vLLM in two main ways:
20+
* OpenAI-Compatible REST Server:
21+
vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK.
22+
* Python API:
23+
Load and serve models programmatically within your own Python scripts for flexible local inference and evaluation.
24+
25+
vLLM supports Hugging Face Transformer models out-of-the-box and scales seamlessly from single-prompt testing to production batch inference.
2326

2427
## What you build
2528

26-
You build a CPU‑optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL). You then validate the build with a quick offline chat example.
29+
In this learning path, you will build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL).
30+
This build enables high-performance LLM inference on Arm servers, leveraging specialized Arm math libraries and kernel optimizations.
31+
After compiling, you’ll validate your build by running a local chat example to confirm functionality and measure baseline inference speed.
2732

2833
## Why this is fast on Arm
2934

35+
vLLM’s performance on Arm servers is driven by both software optimization and hardware-level acceleration.
36+
Each component of this optimized build contributes to higher throughput and lower latency during inference:
37+
3038
- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
31-
- 4‑bit weight quantization: INT4 quantization support & acceleration by Arm KleidiAI microkernels.
32-
- Efficient MoE execution: Fused INT4 quantized expert layers reduce memory traffic and improve throughput.
33-
- Optimized Paged attention: Arm SIMD tuned paged attention implementation in vLLM.
34-
- System tuning: Thread affinity and `tcmalloc` help keep latency and allocator overhead low under load.
39+
- 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) instructions.
40+
- Efficient MoE execution: For Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
41+
- Optimized Paged attention: The paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
42+
- System tuning: Using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.
43+
Additionally, enabling tcmalloc (Thread-Caching Malloc) reduces allocator contention and memory fragmentation under high-throughput serving loads.
3544

3645
## Before you begin
3746

38-
- Use Python 3.12 on Ubuntu 22.04+
39-
- Make sure you have at least 32 vCPUs, 64 GB RAM, and 32 GB free disk.
47+
Verify that your environment meets the following requirements:
48+
49+
Python version: Use Python 3.12 on Ubuntu 22.04 LTS or later.
50+
Hardware requirements: At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space.
4051

41-
Install the minimum system package used by vLLM on Arm:
52+
This Learning Path was validated on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage.
4253

54+
### Install Build Dependencies
55+
56+
Install the following packages required for compiling vLLM and its dependencies on Arm64:
4357
```bash
4458
sudo apt-get update -y
4559
sudo apt-get install -y build-essential cmake libnuma-dev
46-
sudo apt install python3.12-venv python3.12-dev
60+
sudo apt install -y python3.12-venv python3.12-dev
4761
```
4862

49-
Optional performance helper you can install now or later:
63+
You can optionally install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency:
5064

5165
```bash
5266
sudo apt-get install -y libtcmalloc-minimal4
5367
```
5468

5569
{{% notice Note %}}
56-
On aarch64, vLLM’s CPU backend automatically builds with Arm Compute Library via oneDNN.
70+
On aarch64, vLLM’s CPU backend automatically builds with the Arm Compute Library (ACL) through oneDNN.
71+
This ensures optimized Arm kernels are used for matrix multiplications, layer normalization, and activation functions without additional configuration.
5772
{{% /notice %}}
5873

59-
## Build vLLM for aarch64 CPU
74+
## Build vLLM for Arm64 CPU
75+
You’ll now build vLLM optimized for Arm (aarch64) servers with oneDNN and the Arm Compute Library (ACL) automatically enabled in the CPU backend.
6076

61-
Create and activate a virtual environment:
77+
1. Create and Activate a Python Virtual Environment
78+
It’s best practice to build vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
6279

6380
```bash
6481
python3.12 -m venv vllm_env
6582
source vllm_env/bin/activate
6683
python3 -m pip install --upgrade pip
6784
```
6885

69-
Clone vLLM and install build requirements:
86+
2. Clone vLLM and Install Build Requirements
87+
Download the official vLLM source code and install its CPU-specific build dependencies:
7088

7189
```bash
7290
git clone https://github.com/vllm-project/vllm.git
7391
cd vllm
7492
git checkout 5fb4137
7593
pip install -r requirements/cpu.txt -r requirements/cpu-build.txt
7694
```
95+
The specific commit (5fb4137) pins a verified version of vLLM that officially adds Arm CPUs to the list of supported build targets, ensuring full compatibility and optimized performance for Arm-based systems.
7796

78-
Build a wheel targeted at CPU:
97+
3. Build the vLLM Wheel for CPU
98+
Run the following command to compile and package vLLM as a Python wheel optimized for CPU inference:
7999

80100
```bash
81101
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
82102
```
103+
The output wheel will appear under dist/ and include all compiled C++/PyBind modules.
83104

84-
Install the wheel. Use `--no-deps` for incremental installs to avoid clobbering your environment:
105+
4. Install the Wheel
106+
Install the freshly built wheel into your active environment:
85107

86108
```bash
87109
pip install --force-reinstall dist/*.whl # fresh install
88110
# pip install --no-deps --force-reinstall dist/*.whl # incremental build
89111
```
90112

91113
{{% notice Tip %}}
92-
Do NOT delete vLLM repo. Local vLLM repository is required for corect inferencing on aarch64 CPU after installing the wheel.
114+
Do not delete the local vLLM source directory.
115+
The repository contains C++ extensions and runtime libraries required for correct CPU inference on aarch64 after wheel installation.
93116
{{% /notice %}}
94117

95-
## Quick validation via offline inferencing
118+
## Quick validation via Offline Inferencing
96119

97-
Run the built‑in chat example to confirm the build:
120+
Once your Arm-optimized vLLM build completes, you can validate it by running a small offline inference example. This ensures that the CPU-specific backend and oneDNN and ACL optimizations were correctly compiled into your build.
121+
Run the built-in chat example included in the vLLM repository:
98122

99123
```bash
100124
python examples/offline_inference/basic/chat.py \
101125
--dtype=bfloat16 \
102126
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
103127
```
104128

105-
You should see tokens streaming and a final response. This verifies the optimized vLLM build on your Arm server.
129+
Explanation:
130+
--dtype=bfloat16 runs inference in bfloat16 precision. Recent Arm processors support the BFloat16 (BF16) number format in PyTorch. For example, AWS Graviton3 and Graviton3 processors support BFloat16.
131+
--model specifies a small Hugging Face model for testing (TinyLlama-1.1B-Chat), ideal for functional validation before deploying larger models.
132+
You should see token streaming in the console, followed by a generated output confirming that vLLM’s inference pipeline is working correctly.
106133

107134
```output
108135
Generated Outputs:
@@ -117,5 +144,8 @@ Processed prompts: 100%|██████████████████
117144
```
118145

119146
{{% notice Note %}}
120-
As CPU support in vLLM continues to mature, manual builds will be replaced by a simple `pip install` flow for easier setup in near future.
147+
As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined pip install workflow for aarch64, simplifying future deployments on Arm servers.
121148
{{% /notice %}}
149+
150+
You have now verified that your vLLM Arm64 build runs correctly and performs inference using Arm-optimized kernels.
151+
Next, you’ll proceed to model quantization, where you’ll compress LLM weights to INT4 precision using llmcompressor and benchmark the resulting performance improvements.

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,33 +5,39 @@ weight: 3
55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8+
## Accelerating LLMs with 4-bit Quantization
89

9-
You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this guide, we use `deepseek-ai/DeepSeek-V2-Lite` as the example model which gets accelerated by the INT4 path in vLLM using Arm KleidiAI microkernels.
10+
You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this section, you’ll quantize the deepseek-ai/DeepSeek-V2-Lite model to 4-bit integer (INT4) weights.
11+
The quantized model runs efficiently through vLLM’s INT4 inference path, which is accelerated by Arm KleidiAI microkernels.
1012

1113
## Install quantization tools
1214

13-
Install the vLLM model quantization packages
15+
Install the quantization dependencies used by vLLM and the llmcompressor toolkit:
1416

1517
```bash
1618
pip install --no-deps compressed-tensors
1719
pip install llmcompressor
1820
```
19-
20-
Reinstall your locally built vLLM if you rebuilt it:
21+
* compressed-tensors provides the underlying tensor storage and compression utilities used for quantized model formats.
22+
* llmcompressor includes quantization, pruning, and weight clustering utilities compatible with Hugging Face Transformers and vLLM runtime formats.
23+
24+
If you recently rebuilt vLLM, reinstall your locally built wheel to ensure compatibility with the quantization extensions:
2125

2226
```bash
2327
pip install --no-deps dist/*.whl
2428
```
2529

26-
If your chosen model is gated on Hugging Face, authenticate first:
30+
Authenticate with Hugging Face (if required):
31+
32+
If the model you plan to quantize is gated on Hugging Face (e.g., DeepSeek or proprietary models), log in to authenticate your credentials before downloading model weights:
2733

2834
```bash
2935
huggingface-cli login
3036
```
3137

32-
## INT4 Quantization recipe
38+
## INT4 Quantization Recipe
3339

34-
Save the following as `quantize_vllm_models.py`:
40+
Using a file editor of your choice, save the following code into a file named `quantize_vllm_models.py`:
3541

3642
```python
3743
import argparse
@@ -124,22 +130,26 @@ if __name__ == "__main__":
124130
main()
125131
```
126132

127-
This script creates a Arm KleidiAI 4‑bit quantized copy of the vLLM model and saves it to a new directory.
133+
This script creates a Arm KleidiAI INT4 quantized copy of the vLLM model and saves it to a new directory.
128134

129135
## Quantize DeepSeek‑V2‑Lite model
130136

131137
### Quantization parameter tuning
138+
Quantization parameters determine how the model’s floating-point weights and activations are converted into lower-precision integer formats. Choosing the right combination is essential for balancing model accuracy, memory footprint, and runtime throughput on Arm CPUs.
139+
132140
1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method.
133141
2. `channelwise` is a good default for most models.
134142
3. `groupwise` can improve accuracy further; `--groupsize 32` is common.
135143

144+
Execute the following command to quantize the DeepSeek-V2-Lite model:
145+
136146
```bash
137147
# DeepSeek example
138148
python3 quantize_vllm_models.py deepseek-ai/DeepSeek-V2-Lite \
139149
--scheme channelwise --method mse
140150
```
141151

142-
The 4-bit quantized DeepSeek-V2-Lite will be stored the directory:
152+
This will generate an INT4 quantized model directory such as:
143153

144154
```text
145155
DeepSeek-V2-Lite-w4a8dyn-mse-channelwise

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md

Lines changed: 36 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,20 @@ weight: 4
66
layout: learningpathall
77
---
88

9-
## About batch sizing in vLLM
9+
## Batch Sizing in vLLM
1010

11-
vLLM enforces two limits to balance memory use and throughput: a per‑sequence length (`max_model_len`) and a per‑batch token limit (`max_num_batched_tokens`). No single request can exceed the sequence limit, and the sum of tokens in a batch must stay within the batch limit.
11+
vLLM uses dynamic continuous batching to maximize hardware utilization. Two key parameters govern this process:
12+
* `max_model_len` — The maximum sequence length (number of tokens per request).
13+
No single prompt or generated sequence can exceed this limit.
14+
* `max_num_batched_tokens` — The total number of tokens processed in one batch across all requests.
15+
The sum of input and output tokens from all concurrent requests must stay within this limit.
16+
17+
Together, these parameters determine how much memory the model can use and how effectively CPU threads are saturated.
18+
On Arm-based servers, tuning them helps achieve stable throughput while avoiding excessive paging or cache thrashing.
1219

1320
## Serve an OpenAI‑compatible API
1421

15-
Start the server with sensible CPU default parameters and a quantized model:
22+
Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance.
1623

1724
```bash
1825
export VLLM_TARGET_DEVICE=cpu
@@ -27,9 +34,19 @@ vllm serve DeepSeek-V2-Lite-w4a8dyn-mse-channelwise \
2734
--dtype float32 --max-model-len 4096 --max-num-batched-tokens 4096
2835
```
2936

37+
The server now exposes the standard OpenAI-compatible /v1/chat/completions endpoint.
38+
39+
You can test it using any OpenAI-style client library to measure tokens-per-second throughput and response latency on your Arm-based server.
40+
3041
## Run multi‑request batch
42+
After verifying a single request in the previous section, simulate concurrent load against the OpenAI-compatible server to exercise vLLM’s continuous batching scheduler.
3143

32-
After confirming a single request works explained in previous example, simulate concurrent load with a small OpenAI API compatible client. Save this as `batch_test.py`:
44+
About the client:
45+
Uses AsyncOpenAI with base_url="http://localhost:8000/v1" to target the vLLM server.
46+
A semaphore caps concurrency to 8 simultaneous requests (adjust CONCURRENCY to scale load).
47+
max_tokens limits generated tokens per request—this directly affects batch size and KV cache use.
48+
49+
Save the code below in a file named `batch_test.py`:
3350

3451
```python
3552
import asyncio
@@ -88,7 +105,7 @@ if __name__ == "__main__":
88105
asyncio.run(main())
89106
```
90107

91-
Run 8 concurrent requests against your server:
108+
Run 8 concurrent requests:
92109

93110
```bash
94111
python3 batch_test.py
@@ -108,19 +125,28 @@ This validates multi‑request behavior and shows aggregate throughput in the se
108125
(APIServer pid=4474) INFO: 127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
109126
(APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
110127
```
111-
## Optional: Serving BF16 non-quantized model
128+
## Optional: Serve a BF16 (Non-Quantized) Model
112129

113-
For a BF16 path on Arm, vLLM is acclerated by direct oneDNN integration in vLLM which allows aarch64 model to be hyperoptimized.
130+
For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels via ACL under aarch64).
114131

115132
```bash
116133
vllm serve deepseek-ai/DeepSeek-V2-Lite \
117134
--dtype bfloat16 --max-model-len 4096 \
118135
--max-num-batched-tokens 4096
119136
```
137+
Use this BF16 setup to establish a quality reference baseline, then compare throughput and latency against your INT4 deployment to quantify the performance/accuracy trade-offs on your Arm system.
120138

121139
## Go Beyond: Power Up Your vLLM Workflow
122-
Now that you’ve successfully quantized and served a model using vLLM on Arm, here are some further ways to explore:
140+
Now that you’ve successfully quantized, served, and benchmarked a model using vLLM on Arm, you can build on what you’ve learned to push performance, scalability, and usability even further.
141+
142+
**Try Different Models**
143+
Extend your workflow to other models on Hugging Face that are compatible with vLLM and can benefit from Arm acceleration:
144+
* Meta Llama 2 / Llama 3 – Strong general-purpose baselines; excellent for comparing BF16 vs INT4 performance.
145+
* Qwen / Qwen-Chat – High-quality multilingual and instruction-tuned models.
146+
* Gemma (Google) – Compact and efficient architecture; ideal for edge or cost-optimized serving.
147+
148+
You can quantize and serve them using the same `quantize_vllm_models.py` recipe, just update the model name.
123149

124-
* **Try different models:** Apply the same steps to other [Hugging Face models](https://huggingface.co/models) like Llama, Qwen or Gemma.
150+
**Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
125151

126-
* **Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
152+
You can continue exploring how Arm’s efficiency, oneDNN+ACL acceleration, and vLLM’s dynamic batching combine to deliver fast, sustainable, and scalable AI inference on modern Arm architectures.

0 commit comments

Comments
 (0)