Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
title: Overview and Optimized Build
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## What is vLLM?

vLLM is an open‑source, high‑throughput inference and serving engine for large language models. It focuses on efficient execution of the LLM inference prefill and decode phases with:

- Continuous batching to keep hardware busy across many requests.
- KV cache management to sustain concurrency during generation.
- Token streaming so results appear as they are produced.

You interact with vLLM in multiple ways:

- OpenAI‑compatible server: expose `/v1/chat/completions` for easy integration.
- Python API: load a model and generate locally when needed.

vLLM works well with Hugging Face models, supports single‑prompt and batch workloads, and scales from quick tests to production serving.

## What you build

You build a CPU‑optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL). You then validate the build with a quick offline chat example.

## Why this is fast on Arm

- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
- 4‑bit weight quantization: INT4 quantization support & acceleration by Arm KleidiAI microkernels.
- Efficient MoE execution: Fused INT4 quantized expert layers reduce memory traffic and improve throughput.
- Optimized Paged attention: Arm SIMD tuned paged attention implementation in vLLM.
- System tuning: Thread affinity and `tcmalloc` help keep latency and allocator overhead low under load.

## Before you begin

- Use Python 3.12 on Ubuntu 22.04+
- Make sure you have at least 32 vCPUs, 64 GB RAM, and 32 GB free disk.

Install the minimum system package used by vLLM on Arm:

```bash
sudo apt-get update -y
sudo apt-get install -y libnuma-dev
```

Optional performance helper you can install now or later:

```bash
sudo apt-get install -y libtcmalloc-minimal4
```

{{% notice Note %}}
On aarch64, vLLM’s CPU backend automatically builds with Arm Compute Library via oneDNN.
{{% /notice %}}

## Build vLLM for aarch64 CPU

Create and activate a virtual environment:

```bash
python3 -m venv vllm_env
source vllm_env/bin/activate
python -m pip install --upgrade pip
```

Clone vLLM and install build requirements:

```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 5fb4137
pip install -r requirements/cpu.txt -r requirements/cpu-build.txt
```

Build a wheel targeted at CPU:

```bash
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
```

Install the wheel. Use `--no-deps` for incremental installs to avoid clobbering your environment:

```bash
pip install --force-reinstall dist/*.whl # fresh install
# pip install --no-deps --force-reinstall dist/*.whl # incremental build
```

{{% notice Tip %}}
Do NOT delete vLLM repo. Local vLLM repository is required for corect inferencing on aarch64 CPU after installing the wheel.
{{% /notice %}}

## Quick validation via offline inferencing

Run the built‑in chat example to confirm the build:

```bash
python examples/offline_inference/basic/chat.py \
--dtype=bfloat16 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
```

You should see tokens streaming and a final response. This verifies the optimized vLLM build on your Arm server.

```output
Generated Outputs:
--------------------------------------------------------------------------------
Prompt: None

Generated text: 'The Importance of Higher Education\n\nHigher education is a fundamental right'
--------------------------------------------------------------------------------
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9552.05it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.78it/s, est. speed input: 474.32 toks/s, output: 108.42 toks/s]
...
```

{{% notice Note %}}
As CPU support in vLLM continues to mature, manual builds will be replaced by a simple `pip install` flow for easier setup in near future.
{{% /notice %}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
---
title: Quantize an LLM to INT4 for Arm Platform
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this guide, we use `deepseek-ai/DeepSeek-V2-Lite` as the example model which gets accelerated by the INT4 path in vLLM using Arm KleidiAI microkernels.

## Install quantization tools

Install the vLLM model quantization packages

```bash
pip install --no-deps compressed-tensors
pip install llmcompressor
```

Reinstall your locally built vLLM if you rebuilt it:

```bash
pip install --no-deps dist/*.whl
```

If your chosen model is gated on Hugging Face, authenticate first:

```bash
huggingface-cli login
```

## INT4 Quantization recipe

Save the following as `quantize_vllm_models.py`:

```python
import argparse
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from compressed_tensors.quantization import QuantizationScheme
from compressed_tensors.quantization.quant_args import (
QuantizationArgs,
QuantizationStrategy,
QuantizationType,
)

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization.quantization import QuantizationModifier


def main():
parser = argparse.ArgumentParser(description="Quantize a model using INT4 with minmax or mse and dynamic activation quantization.")
parser.add_argument("model_id", type=str, help="Model identifier or path")
parser.add_argument("--method", type=str, choices=["minmax", "mse"], default="mse", help="Quantization method")
parser.add_argument("--scheme", type=str, choices=["channelwise", "groupwise"], required=True, help="Quantization scheme for weights")
parser.add_argument("--groupsize", type=int, default=32, help="Group size for groupwise quantization")
args = parser.parse_args()

# Extract base model name for output dir
base_model_name = os.path.basename(args.model_id.rstrip("/"))
act_tag = "a8dyn"
suffix = f"{args.method}-{args.scheme}"
if args.scheme == "groupwise":
suffix += f"-g{args.groupsize}"
output_dir = f"{base_model_name}-w4{act_tag}-{suffix}"

print(f"Loading model '{args.model_id}'...")
model = AutoModelForCausalLM.from_pretrained(
args.model_id, trust_remote_code=True
)
model = model.to(torch.float32)
tokenizer = AutoTokenizer.from_pretrained(args.model_id)

# Weight quantization args
strategy = QuantizationStrategy.CHANNEL if args.scheme == "channelwise" else QuantizationStrategy.GROUP
weights_args = QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
strategy=strategy,
symmetric=True,
dynamic=False,
group_size=args.groupsize if args.scheme == "groupwise" else None,
observer=args.method,
)

# Activation quantization
input_acts = QuantizationArgs(
num_bits=8,
type=QuantizationType.INT,
strategy=QuantizationStrategy.TOKEN,
symmetric=False,
dynamic=True,
observer=None,
)
output_acts = None

# Create quantization scheme and recipe
scheme = QuantizationScheme(
targets=["Linear"],
weights=weights_args,
input_activations=input_acts,
output_activations=output_acts,
)
recipe = QuantizationModifier(
config_groups={"group_0": scheme},
ignore=["lm_head"],
)

# Run quantization
oneshot(
model=model,
recipe=recipe,
tokenizer=tokenizer,
output_dir=output_dir,
trust_remote_code_model=True,
)

print(f"Quantized model saved to: {output_dir}")


if __name__ == "__main__":
main()
```

This script creates a Arm KleidiAI 4‑bit quantized copy of the vLLM model and saves it to a new directory.

## Quantize DeepSeek‑V2‑Lite model

### Quantization parameter tuning
1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method.
2. `channelwise` is a good default for most models.
3. `groupwise` can improve accuracy further; `--groupsize 32` is common.

```bash
# DeepSeek example
python quantize_vllm_models.py deepseek-ai/DeepSeek-V2-Lite \
--scheme channelwise --method mse
```

The 4-bit quantized DeepSeek-V2-Lite will be stored the directory:

```text
DeepSeek-V2-Lite-w4a8dyn-mse-channelwise
```

You will load this quantized model directory with vLLM in the next step.
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
---
title: Serve high throughput inference with vLLM
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## About batch sizing in vLLM

vLLM enforces two limits to balance memory use and throughput: a per‑sequence length (`max_model_len`) and a per‑batch token limit (`max_num_batched_tokens`). No single request can exceed the sequence limit, and the sum of tokens in a batch must stay within the batch limit.

## Serve an OpenAI‑compatible API

Start the server with sensible CPU default parameters and a quantized model:

```bash
export VLLM_TARGET_DEVICE=cpu
export VLLM_CPU_KVCACHE_SPACE=32
export VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))"
export VLLM_MLA_DISABLE=1
export ONEDNN_DEFAULT_FPMATH_MODE=BF16
export OMP_NUM_THREADS="$(nproc)"
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4

vllm serve DeepSeek-V2-Lite-w4a8dyn-mse-channelwise \
--dtype float32 --max-model-len 4096 --max-num-batched-tokens 4096
```

## Run multi‑request batch

After confirming a single request works explained in previous example, simulate concurrent load with a small OpenAI API compatible client. Save this as `batch_test.py`:

```python
import asyncio
import time
from openai import AsyncOpenAI

# vLLM's OpenAI-compatible server
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

model = "DeepSeek-V2-Lite-w4a8dyn-mse-channelwise" # vllm server model

# Batch of 8 prompts
messages_list = [
[{"role": "user", "content": "Explain Big O notation with two examples."}],
[{"role": "user", "content": "Show a simple recursive function and explain how it works."}],
[{"role": "user", "content": "Draft a polite email requesting a project deadline extension."}],
[{"role": "user", "content": "Explain what a hash function is and common uses."}],
[{"role": "user", "content": "Explain binary search and its time complexity."}],
[{"role": "user", "content": "Write a Python function that checks if a string is a palindrome."}],
[{"role": "user", "content": "Explain how caching improves performance with a simple analogy."}],
[{"role": "user", "content": "Explain the difference between supervised and unsupervised learning."}],
]

CONCURRENCY = 8

async def run_one(i: int, messages):
resp = await client.chat.completions.create(
model=model,
messages=messages,
max_tokens=128, # Change as per comfiguration
)
return i, resp.choices[0].message.content

async def main():
t0 = time.time()
sem = asyncio.Semaphore(CONCURRENCY)

async def guarded_run(i, msgs):
async with sem:
try:
return await run_one(i, msgs)
except Exception as e:
return i, f"[ERROR] {type(e).__name__}: {e}"

tasks = [asyncio.create_task(guarded_run(i, msgs)) for i, msgs in enumerate(messages_list, start=1)]
results = await asyncio.gather(*tasks) # order corresponds to tasks list

# Print outputs in input order
results.sort(key=lambda x: x[0])
for idx, out in results:
print(f"\n=== Output {idx} ===\n{out}\n")

print(f"Batch completed in : {time.time() - t0:.2f}s")

if __name__ == "__main__":
asyncio.run(main())
```

Run 8 concurrent requests against your server:

```bash
python3 batch_test.py
```

This validates multi‑request behavior and shows aggregate throughput in the server logs.

```output
(APIServer pid=4474) INFO 11-10 01:00:56 [loggers.py:221] Engine 000: Avg prompt throughput: 19.7 tokens/s, Avg generation throughput: 187.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 0.0%
(APIServer pid=4474) INFO: 127.0.0.1:44060 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=4474) INFO: 127.0.0.1:44134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=4474) INFO: 127.0.0.1:44076 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=4474) INFO: 127.0.0.1:44068 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=4474) INFO: 127.0.0.1:44100 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=4474) INFO: 127.0.0.1:44112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=4474) INFO: 127.0.0.1:44090 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=4474) INFO: 127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
```
## Optional: Serving BF16 non-quantized model

For a BF16 path on Arm, vLLM is acclerated by direct oneDNN integration in vLLM which allows aarch64 model to be hyperoptimized.

```bash
vllm serve deepseek-ai/DeepSeek-V2-Lite \
--dtype bfloat16 --max-model-len 4096 \
--max-num-batched-tokens 4096
```

## Go Beyond: Power Up Your vLLM Workflow
Now that you’ve successfully quantized and served a model using vLLM on Arm, here are some further ways to explore:

* **Try different models:** Apply the same steps to other [Hugging Face models](https://huggingface.co/models) like Llama, Qwen or Gemma.

* **Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
Loading