Skip to content

Commit 927b1f5

Browse files
Revise overview and build instructions for vLLM
Updated the document to reflect the new title and improved clarity in the key features and instructions. Enhanced formatting and consistency throughout the text.
1 parent a363c8f commit 927b1f5

File tree

1 file changed

+61
-30
lines changed

1 file changed

+61
-30
lines changed

content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md

Lines changed: 61 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Overview and Optimized Build
2+
title: Build and validate vLLM for Arm64 inference on Azure Cobalt 100
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
@@ -8,50 +8,57 @@ layout: learningpathall
88

99
## What is vLLM?
1010

11-
vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs).
12-
It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference.
11+
vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference.
1312

14-
### Key Features
15-
* Continuous Batching – Dynamically combines incoming inference requests into a single large batch, maximizing CPU/GPU utilization and throughput.
16-
* KV Cache Management – Efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead.
17-
* Token Streaming – Streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios.
18-
### Interaction Modes
13+
## Key features
14+
* Continuous batching: dynamically merges incoming inference requests into larger batches, maximizing Arm CPU utilization and overall throughput
15+
* KV cache management: efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead
16+
* Token streaming: streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios
17+
## Interaction modes
1918
You can use vLLM in two main ways:
20-
* OpenAI-Compatible REST Server:
21-
vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK.
22-
* Python API:
23-
Load and serve models programmatically within your own Python scripts for flexible local inference and evaluation.
19+
- Using an OpenAI-Compatible REST Server: vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK
20+
- Using a Python API: load and serve models programmatically within your own Python scripts for flexible local inference and evaluation
2421

2522
vLLM supports Hugging Face Transformer models out-of-the-box and scales seamlessly from single-prompt testing to production batch inference.
2623

27-
## What you build
24+
## What'll you build
2825

29-
In this learning path, you will build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL).
26+
In this Learning Path, you'll build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL).
3027
This build enables high-performance LLM inference on Arm servers, leveraging specialized Arm math libraries and kernel optimizations.
3128
After compiling, you’ll validate your build by running a local chat example to confirm functionality and measure baseline inference speed.
3229

3330
## Why this is fast on Arm
3431

32+
vLLM achieves high performance on Arm servers by combining software and hardware optimizations. Here’s why your build runs fast:
33+
34+
- Arm-optimized kernels: vLLM uses oneDNN and the Arm Compute Library to accelerate matrix multiplications, normalization, and activation functions. These libraries are tuned for Arm’s aarch64 architecture.
35+
- Efficient quantization: INT4 quantized models run faster on Arm because KleidiAI microkernels use DOT-product instructions (SDOT/UDOT) available on Arm CPUs.
36+
- Paged attention tuning: the paged attention mechanism is optimized for Arm’s NEON and SVE pipelines, improving token reuse and throughput during long-sequence generation.
37+
- MoE fusion: for Mixture-of-Experts models, vLLM fuses INT4 expert layers to reduce memory transfers and bandwidth bottlenecks.
38+
- Thread affinity and memory allocation: setting thread affinity ensures balanced CPU core usage, while tcmalloc reduces memory fragmentation and allocator contention.
39+
40+
These optimizations work together to deliver higher throughput and lower latency for LLM inference on Arm servers.
41+
3542
vLLM’s performance on Arm servers is driven by both software optimization and hardware-level acceleration.
3643
Each component of this optimized build contributes to higher throughput and lower latency during inference:
3744

38-
- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
45+
- Optimized kernels: the aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
3946
- 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) instructions.
40-
- Efficient MoE execution: For Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
41-
- Optimized Paged attention: The paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
42-
- System tuning: Using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.
47+
- Efficient MoE execution: for Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
48+
- Optimized Paged attention: the paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
49+
- System tuning: using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.
4350
Additionally, enabling tcmalloc (Thread-Caching Malloc) reduces allocator contention and memory fragmentation under high-throughput serving loads.
4451

45-
## Before you begin
52+
## Set up your environment
4653

47-
Verify that your environment meets the following requirements:
54+
Before you begin, make sure your environment meets these requirements:
4855

49-
Python version: Use Python 3.12 on Ubuntu 22.04 LTS or later.
50-
Hardware requirements: At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space.
56+
- Python 3.12 on Ubuntu 22.04 LTS or newer
57+
- At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space
5158

52-
This Learning Path was validated on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage.
59+
This Learning Path was tested on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage.
5360

54-
### Install Build Dependencies
61+
## Install build dependencies
5562

5663
Install the following packages required for compiling vLLM and its dependencies on Arm64:
5764
```bash
@@ -74,7 +81,7 @@ This ensures optimized Arm kernels are used for matrix multiplications, layer no
7481
## Build vLLM for Arm64 CPU
7582
You’ll now build vLLM optimized for Arm (aarch64) servers with oneDNN and the Arm Compute Library (ACL) automatically enabled in the CPU backend.
7683

77-
1. Create and Activate a Python Virtual Environment
84+
## Create and activate a Python virtual environment
7885
It’s best practice to build vLLM inside an isolated environment to prevent conflicts between system and project dependencies:
7986

8087
```bash
@@ -83,7 +90,7 @@ source vllm_env/bin/activate
8390
python3 -m pip install --upgrade pip
8491
```
8592

86-
2. Clone vLLM and Install Build Requirements
93+
## Clone vLLM and install build requirements
8794
Download the official vLLM source code and install its CPU-specific build dependencies:
8895

8996
```bash
@@ -94,15 +101,15 @@ pip install -r requirements/cpu.txt -r requirements/cpu-build.txt
94101
```
95102
The specific commit (5fb4137) pins a verified version of vLLM that officially adds Arm CPUs to the list of supported build targets, ensuring full compatibility and optimized performance for Arm-based systems.
96103

97-
3. Build the vLLM Wheel for CPU
104+
## Build the vLLM wheel for CPU
98105
Run the following command to compile and package vLLM as a Python wheel optimized for CPU inference:
99106

100107
```bash
101108
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
102109
```
103110
The output wheel will appear under dist/ and include all compiled C++/PyBind modules.
104111

105-
4. Install the Wheel
112+
## Install the wheel
106113
Install the freshly built wheel into your active environment:
107114

108115
```bash
@@ -115,7 +122,31 @@ Do not delete the local vLLM source directory.
115122
The repository contains C++ extensions and runtime libraries required for correct CPU inference on aarch64 after wheel installation.
116123
{{% /notice %}}
117124

118-
## Quick validation via Offline Inferencing
125+
## Validate your build with offline inference
126+
127+
Run a quick test to confirm your Arm-optimized vLLM build works as expected. Use the built-in chat example to perform offline inference and verify that oneDNN and Arm Compute Library optimizations are active.
128+
129+
```bash
130+
python examples/offline_inference/basic/chat.py \
131+
--dtype=bfloat16 \
132+
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
133+
```
134+
135+
This command runs a small Hugging Face model in bfloat16 precision, streaming generated tokens to the console. You should see output similar to:
136+
137+
```output
138+
Generated Outputs:
139+
--------------------------------------------------------------------------------
140+
Prompt: None
141+
142+
Generated text: 'The Importance of Higher Education\n\nHigher education is a fundamental right'
143+
--------------------------------------------------------------------------------
144+
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9552.05it/s]
145+
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.78it/s, est. speed input: 474.32 toks/s, output: 108.42 toks/s]
146+
...
147+
```
148+
149+
If you see token streaming and generated text, your vLLM build is correctly configured for Arm64 inference.
119150

120151
Once your Arm-optimized vLLM build completes, you can validate it by running a small offline inference example. This ensures that the CPU-specific backend and oneDNN and ACL optimizations were correctly compiled into your build.
121152
Run the built-in chat example included in the vLLM repository:
@@ -144,7 +175,7 @@ Processed prompts: 100%|██████████████████
144175
```
145176

146177
{{% notice Note %}}
147-
As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined pip install workflow for aarch64, simplifying future deployments on Arm servers.
178+
As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined `pip` install workflow for aarch64, simplifying future deployments on Arm servers.
148179
{{% /notice %}}
149180

150181
You have now verified that your vLLM Arm64 build runs correctly and performs inference using Arm-optimized kernels.

0 commit comments

Comments
 (0)