You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md
+63-33Lines changed: 63 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,101 +8,128 @@ layout: learningpathall
8
8
9
9
## What is vLLM?
10
10
11
-
vLLM is an open‑source, high‑throughput inference and serving engine for large language models. It focuses on efficient execution of the LLM inference prefill and decode phases with:
12
-
13
-
- Continuous batching to keep hardware busy across many requests.
14
-
- KV cache management to sustain concurrency during generation.
15
-
- Token streaming so results appear as they are produced.
16
-
17
-
You interact with vLLM in multiple ways:
18
-
19
-
- OpenAI‑compatible server: expose `/v1/chat/completions` for easy integration.
20
-
- Python API: load a model and generate locally when needed.
21
-
22
-
vLLM works well with Hugging Face models, supports single‑prompt and batch workloads, and scales from quick tests to production serving.
11
+
vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs).
12
+
It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference.
13
+
14
+
### Key Features
15
+
* Continuous Batching – Dynamically combines incoming inference requests into a single large batch, maximizing CPU/GPU utilization and throughput.
16
+
* KV Cache Management – Efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead.
17
+
* Token Streaming – Streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios.
18
+
### Interaction Modes
19
+
You can use vLLM in two main ways:
20
+
* OpenAI-Compatible REST Server:
21
+
vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK.
22
+
* Python API:
23
+
Load and serve models programmatically within your own Python scripts for flexible local inference and evaluation.
24
+
25
+
vLLM supports Hugging Face Transformer models out-of-the-box and scales seamlessly from single-prompt testing to production batch inference.
23
26
24
27
## What you build
25
28
26
-
You build a CPU‑optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL). You then validate the build with a quick offline chat example.
29
+
In this learning path, you will build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL).
30
+
This build enables high-performance LLM inference on Arm servers, leveraging specialized Arm math libraries and kernel optimizations.
31
+
After compiling, you’ll validate your build by running a local chat example to confirm functionality and measure baseline inference speed.
27
32
28
33
## Why this is fast on Arm
29
34
35
+
vLLM’s performance on Arm servers is driven by both software optimization and hardware-level acceleration.
36
+
Each component of this optimized build contributes to higher throughput and lower latency during inference:
37
+
30
38
- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
31
-
- 4‑bit weight quantization: INT4 quantization support & acceleration by Arm KleidiAI microkernels.
- Optimized Paged attention: Arm SIMD tuned paged attention implementation in vLLM.
34
-
- System tuning: Thread affinity and `tcmalloc` help keep latency and allocator overhead low under load.
39
+
- 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) instructions.
40
+
- Efficient MoE execution: For Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
41
+
- Optimized Paged attention: The paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
42
+
- System tuning: Using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.
43
+
Additionally, enabling tcmalloc (Thread-Caching Malloc) reduces allocator contention and memory fragmentation under high-throughput serving loads.
35
44
36
45
## Before you begin
37
46
38
-
- Use Python 3.12 on Ubuntu 22.04+
39
-
- Make sure you have at least 32 vCPUs, 64 GB RAM, and 32 GB free disk.
47
+
Verify that your environment meets the following requirements:
48
+
49
+
Python version: Use Python 3.12 on Ubuntu 22.04 LTS or later.
50
+
Hardware requirements: At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space.
40
51
41
-
Install the minimum system package used by vLLM on Arm:
52
+
This Learning Path was validated on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage.
42
53
54
+
### Install Build Dependencies
55
+
56
+
Install the following packages required for compiling vLLM and its dependencies on Arm64:
Optional performance helper you can install now or later:
63
+
You can optionally install tcmalloc, a fast memory allocator from Google’s gperftools, which improves performance under high concurrency:
50
64
51
65
```bash
52
66
sudo apt-get install -y libtcmalloc-minimal4
53
67
```
54
68
55
69
{{% notice Note %}}
56
-
On aarch64, vLLM’s CPU backend automatically builds with Arm Compute Library via oneDNN.
70
+
On aarch64, vLLM’s CPU backend automatically builds with the Arm Compute Library (ACL) through oneDNN.
71
+
This ensures optimized Arm kernels are used for matrix multiplications, layer normalization, and activation functions without additional configuration.
57
72
{{% /notice %}}
58
73
59
-
## Build vLLM for aarch64 CPU
74
+
## Build vLLM for Arm64 CPU
75
+
You’ll now build vLLM optimized for Arm (aarch64) servers with oneDNN and the Arm Compute Library (ACL) automatically enabled in the CPU backend.
60
76
61
-
Create and activate a virtual environment:
77
+
1. Create and Activate a Python Virtual Environment
78
+
It’s best practice to build vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
62
79
63
80
```bash
64
81
python3.12 -m venv vllm_env
65
82
source vllm_env/bin/activate
66
83
python3 -m pip install --upgrade pip
67
84
```
68
85
69
-
Clone vLLM and install build requirements:
86
+
2. Clone vLLM and Install Build Requirements
87
+
Download the official vLLM source code and install its CPU-specific build dependencies:
The specific commit (5fb4137) pins a verified version of vLLM that officially adds Arm CPUs to the list of supported build targets, ensuring full compatibility and optimized performance for Arm-based systems.
77
96
78
-
Build a wheel targeted at CPU:
97
+
3. Build the vLLM Wheel for CPU
98
+
Run the following command to compile and package vLLM as a Python wheel optimized for CPU inference:
Do NOT delete vLLM repo. Local vLLM repository is required for corect inferencing on aarch64 CPU after installing the wheel.
114
+
Do not delete the local vLLM source directory.
115
+
The repository contains C++ extensions and runtime libraries required for correct CPU inference on aarch64 after wheel installation.
93
116
{{% /notice %}}
94
117
95
-
## Quick validation via offline inferencing
118
+
## Quick validation via Offline Inferencing
96
119
97
-
Run the built‑in chat example to confirm the build:
120
+
Once your Arm-optimized vLLM build completes, you can validate it by running a small offline inference example. This ensures that the CPU-specific backend and oneDNN and ACL optimizations were correctly compiled into your build.
121
+
Run the built-in chat example included in the vLLM repository:
98
122
99
123
```bash
100
124
python examples/offline_inference/basic/chat.py \
101
125
--dtype=bfloat16 \
102
126
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
103
127
```
104
128
105
-
You should see tokens streaming and a final response. This verifies the optimized vLLM build on your Arm server.
129
+
Explanation:
130
+
--dtype=bfloat16 runs inference in bfloat16 precision. Recent Arm processors support the BFloat16 (BF16) number format in PyTorch. For example, AWS Graviton3 and Graviton3 processors support BFloat16.
131
+
--model specifies a small Hugging Face model for testing (TinyLlama-1.1B-Chat), ideal for functional validation before deploying larger models.
132
+
You should see token streaming in the console, followed by a generated output confirming that vLLM’s inference pipeline is working correctly.
As CPU support in vLLM continues to mature, manual builds will be replaced by a simple `pip install` flow for easier setup in near future.
147
+
As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined pip install workflow for aarch64, simplifying future deployments on Arm servers.
121
148
{{% /notice %}}
149
+
150
+
You have now verified that your vLLM Arm64 build runs correctly and performs inference using Arm-optimized kernels.
151
+
Next, you’ll proceed to model quantization, where you’ll compress LLM weights to INT4 precision using llmcompressor and benchmark the resulting performance improvements.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md
+19-9Lines changed: 19 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,33 +5,39 @@ weight: 3
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
+
## Accelerating LLMs with 4-bit Quantization
8
9
9
-
You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this guide, we use `deepseek-ai/DeepSeek-V2-Lite` as the example model which gets accelerated by the INT4 path in vLLM using Arm KleidiAI microkernels.
10
+
You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this section, you’ll quantize the deepseek-ai/DeepSeek-V2-Lite model to 4-bit integer (INT4) weights.
11
+
The quantized model runs efficiently through vLLM’s INT4 inference path, which is accelerated by Arm KleidiAI microkernels.
10
12
11
13
## Install quantization tools
12
14
13
-
Install the vLLM model quantization packages
15
+
Install the quantization dependencies used by vLLM and the llmcompressor toolkit:
14
16
15
17
```bash
16
18
pip install --no-deps compressed-tensors
17
19
pip install llmcompressor
18
20
```
19
-
20
-
Reinstall your locally built vLLM if you rebuilt it:
21
+
* compressed-tensors provides the underlying tensor storage and compression utilities used for quantized model formats.
22
+
* llmcompressor includes quantization, pruning, and weight clustering utilities compatible with Hugging Face Transformers and vLLM runtime formats.
23
+
24
+
If you recently rebuilt vLLM, reinstall your locally built wheel to ensure compatibility with the quantization extensions:
21
25
22
26
```bash
23
27
pip install --no-deps dist/*.whl
24
28
```
25
29
26
-
If your chosen model is gated on Hugging Face, authenticate first:
30
+
Authenticate with Hugging Face (if required):
31
+
32
+
If the model you plan to quantize is gated on Hugging Face (e.g., DeepSeek or proprietary models), log in to authenticate your credentials before downloading model weights:
27
33
28
34
```bash
29
35
huggingface-cli login
30
36
```
31
37
32
-
## INT4 Quantization recipe
38
+
## INT4 Quantization Recipe
33
39
34
-
Save the following as`quantize_vllm_models.py`:
40
+
Using a file editor of your choice, save the following code into a file named`quantize_vllm_models.py`:
35
41
36
42
```python
37
43
import argparse
@@ -124,22 +130,26 @@ if __name__ == "__main__":
124
130
main()
125
131
```
126
132
127
-
This script creates a Arm KleidiAI 4‑bit quantized copy of the vLLM model and saves it to a new directory.
133
+
This script creates a Arm KleidiAI INT4 quantized copy of the vLLM model and saves it to a new directory.
128
134
129
135
## Quantize DeepSeek‑V2‑Lite model
130
136
131
137
### Quantization parameter tuning
138
+
Quantization parameters determine how the model’s floating-point weights and activations are converted into lower-precision integer formats. Choosing the right combination is essential for balancing model accuracy, memory footprint, and runtime throughput on Arm CPUs.
139
+
132
140
1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method.
133
141
2.`channelwise` is a good default for most models.
134
142
3.`groupwise` can improve accuracy further; `--groupsize 32` is common.
135
143
144
+
Execute the following command to quantize the DeepSeek-V2-Lite model:
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md
+36-10Lines changed: 36 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,20 @@ weight: 4
6
6
layout: learningpathall
7
7
---
8
8
9
-
## About batch sizing in vLLM
9
+
## Batch Sizing in vLLM
10
10
11
-
vLLM enforces two limits to balance memory use and throughput: a per‑sequence length (`max_model_len`) and a per‑batch token limit (`max_num_batched_tokens`). No single request can exceed the sequence limit, and the sum of tokens in a batch must stay within the batch limit.
11
+
vLLM uses dynamic continuous batching to maximize hardware utilization. Two key parameters govern this process:
12
+
*`max_model_len` — The maximum sequence length (number of tokens per request).
13
+
No single prompt or generated sequence can exceed this limit.
14
+
*`max_num_batched_tokens` — The total number of tokens processed in one batch across all requests.
15
+
The sum of input and output tokens from all concurrent requests must stay within this limit.
16
+
17
+
Together, these parameters determine how much memory the model can use and how effectively CPU threads are saturated.
18
+
On Arm-based servers, tuning them helps achieve stable throughput while avoiding excessive paging or cache thrashing.
12
19
13
20
## Serve an OpenAI‑compatible API
14
21
15
-
Start the server with sensible CPU default parameters and a quantized model:
22
+
Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance.
The server now exposes the standard OpenAI-compatible /v1/chat/completions endpoint.
38
+
39
+
You can test it using any OpenAI-style client library to measure tokens-per-second throughput and response latency on your Arm-based server.
40
+
30
41
## Run multi‑request batch
42
+
After verifying a single request in the previous section, simulate concurrent load against the OpenAI-compatible server to exercise vLLM’s continuous batching scheduler.
31
43
32
-
After confirming a single request works explained in previous example, simulate concurrent load with a small OpenAI API compatible client. Save this as `batch_test.py`:
44
+
About the client:
45
+
Uses AsyncOpenAI with base_url="http://localhost:8000/v1" to target the vLLM server.
46
+
A semaphore caps concurrency to 8 simultaneous requests (adjust CONCURRENCY to scale load).
47
+
max_tokens limits generated tokens per request—this directly affects batch size and KV cache use.
48
+
49
+
Save the code below in a file named `batch_test.py`:
33
50
34
51
```python
35
52
import asyncio
@@ -88,7 +105,7 @@ if __name__ == "__main__":
88
105
asyncio.run(main())
89
106
```
90
107
91
-
Run 8 concurrent requests against your server:
108
+
Run 8 concurrent requests:
92
109
93
110
```bash
94
111
python3 batch_test.py
@@ -108,19 +125,28 @@ This validates multi‑request behavior and shows aggregate throughput in the se
108
125
(APIServer pid=4474) INFO: 127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
For a BF16 pathon Arm, vLLM is acclerated by direct oneDNN integration in vLLM which allows aarch64 model to be hyperoptimized.
130
+
For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels via ACL under aarch64).
114
131
115
132
```bash
116
133
vllm serve deepseek-ai/DeepSeek-V2-Lite \
117
134
--dtype bfloat16 --max-model-len 4096 \
118
135
--max-num-batched-tokens 4096
119
136
```
137
+
Use this BF16 setup to establish a quality reference baseline, then compare throughput and latency against your INT4 deployment to quantify the performance/accuracy trade-offs on your Arm system.
120
138
121
139
## Go Beyond: Power Up Your vLLM Workflow
122
-
Now that you’ve successfully quantized and served a model using vLLM on Arm, here are some further ways to explore:
140
+
Now that you’ve successfully quantized, served, and benchmarked a model using vLLM on Arm, you can build on what you’ve learned to push performance, scalability, and usability even further.
141
+
142
+
**Try Different Models**
143
+
Extend your workflow to other models on Hugging Face that are compatible with vLLM and can benefit from Arm acceleration:
144
+
* Meta Llama 2 / Llama 3 – Strong general-purpose baselines; excellent for comparing BF16 vs INT4 performance.
145
+
* Qwen / Qwen-Chat – High-quality multilingual and instruction-tuned models.
146
+
* Gemma (Google) – Compact and efficient architecture; ideal for edge or cost-optimized serving.
147
+
148
+
You can quantize and serve them using the same `quantize_vllm_models.py` recipe, just update the model name.
123
149
124
-
***Try different models:**Apply the same steps to other [Hugging Face models](https://huggingface.co/models) like Llama, Qwen or Gemma.
150
+
**Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
125
151
126
-
***Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
152
+
You can continue exploring how Arm’s efficiency, oneDNN+ACL acceleration, and vLLM’s dynamic batching combine to deliver fast, sustainable, and scalable AI inference on modern Arm architectures.
0 commit comments