Skip to content

Commit 0e9b405

Browse files
authored
Merge pull request #1815 from ranimandepudi/main
Quantize and Run a Large Language Model using vLLM on Arm Servers
2 parents 5661e42 + 0954bfe commit 0e9b405

File tree

6 files changed

+554
-1
lines changed

6 files changed

+554
-1
lines changed
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
title: Overview and Environment Setup
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Overview
10+
11+
[vLLM](https://github.com/vllm-project/vllm) is an open-source, high-throughput inference engine designed to efficiently serve large language models (LLMs). It offers an OpenAI-compatible API, supports dynamic batching, and is optimized for low-latency performance — making it suitable for both real-time and batch inference workloads.
12+
13+
This learning path walks through how to combine vLLM with INT8 quantization techniques to reduce memory usage and improve inference speed, enabling large models like Llama 3.1 to run effectively on Arm-based CPUs.
14+
15+
The model featured in this guide — [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) — is sourced from Hugging Face, quantized using the `llmcompressor`, and deployed using vLLM.
16+
17+
Testing for this learning path was performed on AWS Graviton instance (c8g.16xlarge). The instructions are intended for Arm-based servers running Ubuntu 24.04 LTS.
18+
19+
20+
## Learning Path Setup
21+
22+
This learning path uses a Python virtual environment (`venv`) to manage dependencies in an isolated workspace. This approach ensures a clean environment, avoids version conflicts, and makes it easy to reproduce results — especially when using custom-built packages like `vLLM` and `PyTorch`.
23+
24+
### Set up the Python environment
25+
26+
To get started, create a virtual environment and activate it as shown below:
27+
28+
```bash
29+
sudo apt update
30+
sudo apt install -y python3 python3-venv
31+
python3 -m venv vllm_env
32+
source vllm_env/bin/activate
33+
pip install --upgrade pip
34+
```
35+
This will create a local Python environment named (`vllm_env`) and upgrade pip to the latest version.
36+
37+
### Install system dependencies
38+
39+
These packages are needed to build libraries like OpenBLAS and manage system-level performance:
40+
41+
```bash
42+
sudo apt-get update -y
43+
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-pip
44+
sudo apt install python-is-python3
45+
```
46+
Set the system default compilers to version 12:
47+
48+
```bash
49+
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 \
50+
--slave /usr/bin/g++ g++ /usr/bin/g++-12
51+
```
52+
Next, install the [`tcmalloc memory allocator`](https://docs.vllm.ai/en/latest/getting_started/installation/cpu.html?device=arm), which helps improve performance during inference:
53+
54+
```bash
55+
sudo apt-get install libtcmalloc-minimal4
56+
```
57+
This library will be preloaded during model serving to reduce latency and improve memory efficiency.
58+
59+
### Install OpenBLAS
60+
61+
OpenBLAS is an optimized linear algebra library that improves performance for matrix-heavy operations, which are common in LLM inference. To get the best performance on Arm CPUs, it's recommended to build OpenBLAS from source.
62+
63+
Run these commands to clone and build OpenBLAS:
64+
```bash
65+
git clone https://github.com/OpenMathLib/OpenBLAS.git
66+
cd OpenBLAS
67+
git checkout ef9e3f715
68+
```
69+
{{% notice Note %}}
70+
This commit is known to work reliably with Arm CPU optimizations (BF16, OpenMP) and has been tested in this learning path. Using it ensures consistent behavior. You can try `main`, but newer commits may introduce changes that haven't been validated here.
71+
{{% /notice %}}
72+
73+
```bash
74+
make -j$(nproc) BUILD_BFLOAT16=1 USE_OPENMP=1 NO_SHARED=0 DYNAMIC_ARCH=1 TARGET=ARMV8 CFLAGS=-O3
75+
make -j$(nproc) BUILD_BFLOAT16=1 USE_OPENMP=1 NO_SHARED=0 DYNAMIC_ARCH=1 TARGET=ARMV8 CFLAGS=-O3 PREFIX=/home/ubuntu/OpenBLAS/dist install
76+
```
77+
This will build and install OpenBLAS into `/home/ubuntu/OpenBLAS/dist` with optimizations for Arm CPUs.
78+
79+
### Install Python dependencies
80+
81+
Once the system libraries are in place, install the Python packages required for model quantization and serving. You’ll use prebuilt CPU wheels for vLLM and PyTorch, and install additional tools like `llmcompressor` and `torchvision`.
82+
83+
Before proceeding, make sure the following files are downloaded to your home directory:
84+
```bash
85+
86+
```
87+
These are required to complete the installation and model quantization steps.
88+
Now, navigate to your home directory:
89+
```bash
90+
cd /home/ubuntu/
91+
```
92+
93+
Install the vLLM wheel. This wheel contains the CPU-optimized version of `vLLM`, built specifically for Arm architecture. Installing it from a local `.whl` file ensures compatibility with the rest of your environment and avoids potential conflicts from nightly or default pip installations.
94+
95+
```bash
96+
pip install vllm-0.7.3.dev151+gfaee222b.cpu-cp312-cp312-linux_aarch64.whl --force-reinstall
97+
```
98+
Install `llmcompressor`, which is used to quantize the model:
99+
```bash
100+
pip install llmcompressor
101+
```
102+
Install torchvision (nightly version for CPU):
103+
```bash
104+
pip install --force-reinstall torchvision==0.22.0.dev20250213 --extra-index-url https://download.pytorch.org/whl/nightly/cpu
105+
```
106+
Install the custom PyTorch CPU wheel:<br>
107+
This custom PyTorch wheel is prebuilt for Arm CPU architectures and includes the necessary optimizations for running inference. Installing it locally ensures compatibility with your environment and avoids conflicts with default pip packages.
108+
```bash
109+
pip install torch-2.7.0.dev20250306-cp312-cp312-manylinux_2_28_aarch64.whl --force-reinstall --no-deps
110+
```
111+
112+
You’re now ready to quantize the model and start serving it with `vLLM` on an Arm-based system.
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
---
2+
title: Quantize and Launch the vLLM server
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Access the Model from Hugging Face
10+
11+
Before quantizing, authenticate with Hugging Face using a personal access token. You can generate one from your [Hugging Face Hub](https://huggingface.co/) account under Access Tokens:
12+
13+
```bash
14+
huggingface-cli login --token $hf_token
15+
```
16+
## Quantization Script Template
17+
18+
Create the `vllm_quantize_model.py` script shown below to quantize the model :
19+
```bash
20+
import argparse
21+
import os
22+
from transformers import AutoModelForCausalLM, AutoTokenizer
23+
24+
from llmcompressor.modifiers.quantization import QuantizationModifier
25+
from compressed_tensors.quantization import QuantizationScheme
26+
from compressed_tensors.quantization.quant_args import (
27+
QuantizationArgs,
28+
QuantizationStrategy,
29+
QuantizationType,
30+
)
31+
from llmcompressor.transformers import oneshot
32+
33+
34+
def main():
35+
parser = argparse.ArgumentParser(
36+
description="Quantize a model using LLM Compressor with customizable mode, scheme, and group size."
37+
)
38+
parser.add_argument(
39+
"model_id",
40+
type=str,
41+
help="Model identifier or path (e.g., 'meta-llama/Llama-2-13b-chat-hf' or '/path/to/model')",
42+
)
43+
parser.add_argument(
44+
"--mode",
45+
type=str,
46+
choices=["int4", "int8"],
47+
required=True,
48+
help="Quantization mode: int4 or int8",
49+
)
50+
parser.add_argument(
51+
"--scheme",
52+
type=str,
53+
choices=["channelwise", "groupwise"],
54+
required=True,
55+
help="Quantization scheme for weights (groupwise is only supported for int4)",
56+
)
57+
parser.add_argument(
58+
"--groupsize",
59+
type=int,
60+
default=32,
61+
help="Group size for groupwise quantization (only used when scheme is groupwise). Defaults to 32."
62+
)
63+
args = parser.parse_args()
64+
65+
# Validate unsupported configuration
66+
if args.mode == "int8" and args.scheme == "groupwise":
67+
raise ValueError("Groupwise int8 is unsupported. Please use channelwise for int8.")
68+
69+
# Extract a base model name from the model id or path for the output directory
70+
if "/" in args.model_id:
71+
base_model_name = args.model_id.split("/")[-1]
72+
else:
73+
base_model_name = os.path.basename(args.model_id)
74+
75+
# Determine output directory based on mode and scheme
76+
if args.mode == "int4":
77+
output_dir = f"{base_model_name}-w4a8-{args.scheme}"
78+
else: # int8
79+
output_dir = f"{base_model_name}-w8a8-{args.scheme}"
80+
81+
print(f"Loading model '{args.model_id}'...")
82+
model = AutoModelForCausalLM.from_pretrained(
83+
args.model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True
84+
)
85+
tokenizer = AutoTokenizer.from_pretrained(args.model_id)
86+
87+
# Define quantization arguments based on mode and chosen scheme.
88+
if args.mode == "int8":
89+
# Only channelwise is supported for int8.
90+
weights_args = QuantizationArgs(
91+
num_bits=8,
92+
type=QuantizationType.INT,
93+
strategy=QuantizationStrategy.CHANNEL,
94+
symmetric=True,
95+
dynamic=False,
96+
)
97+
else: # int4 mode
98+
if args.scheme == "channelwise":
99+
strategy = QuantizationStrategy.CHANNEL
100+
weights_args = QuantizationArgs(
101+
num_bits=4,
102+
type=QuantizationType.INT,
103+
strategy=strategy,
104+
symmetric=True,
105+
dynamic=False,
106+
)
107+
else: # groupwise
108+
strategy = QuantizationStrategy.GROUP
109+
weights_args = QuantizationArgs(
110+
num_bits=4,
111+
type=QuantizationType.INT,
112+
strategy=strategy,
113+
group_size=args.groupsize,
114+
symmetric=True,
115+
dynamic=False
116+
)
117+
118+
# Activation quantization remains the same for both modes.
119+
activations_args = QuantizationArgs(
120+
num_bits=8,
121+
type=QuantizationType.INT,
122+
strategy=QuantizationStrategy.TOKEN,
123+
symmetric=False,
124+
dynamic=True,
125+
observer=None,
126+
)
127+
128+
# Create a quantization scheme for Linear layers.
129+
scheme = QuantizationScheme(
130+
targets=["Linear"],
131+
weights=weights_args,
132+
input_activations=activations_args,
133+
)
134+
135+
# Create a quantization modifier. We ignore the "lm_head" layer.
136+
modifier = QuantizationModifier(config_groups={"group_0": scheme}, ignore=["lm_head"])
137+
138+
# Apply quantization and save the quantized model.
139+
oneshot(
140+
model=model,
141+
recipe=modifier,
142+
tokenizer=tokenizer,
143+
output_dir=output_dir,
144+
)
145+
print(f"Quantized model saved to: {output_dir}")
146+
147+
148+
if __name__ == "__main__":
149+
main()
150+
151+
152+
```
153+
Then run the quantization script using `vllm_quantize_model.py`. This generates an INT8 quantized version of the model using channelwise precision, which reduces memory usage while maintaining model accuracy:
154+
155+
```bash
156+
cd /home/ubuntu/
157+
python vllm_quantize_model.py meta-llama/Llama-3.1-8B-Instruct --mode int8 --scheme channelwise
158+
```
159+
The output model will be saved locally at:
160+
`/home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise`.
161+
162+
## Launch the vLLM server
163+
164+
The vLLM server supports the OpenAI-compatible `/v1/chat/completions` API. This is used in this learning path for single-prompt testing with `curl` and for batch testing using a custom Python script that simulates multiple concurrent requests.
165+
166+
Once the model is quantized, launch the vLLM server to enable CPU-based inference. This configuration uses `tcmalloc` and the optimized `OpenBLAS` build to improve performance and reduce latency:
167+
168+
```bash
169+
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/home/ubuntu/OpenBLAS/libopenblas.so \
170+
ONEDNN_DEFAULT_FPMATH_MODE=BF16 \
171+
VLLM_TARGET_DEVICE=cpu \
172+
VLLM_CPU_KVCACHE_SPACE=32 \
173+
VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc) - 1))" \
174+
vllm serve /home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise \
175+
--dtype float32 --swap-space 16
176+
```
177+
This command starts the vLLM server using the quantized model. It preloads `tcmalloc` for efficient memory allocation and uses OpenBLAS for accelerated matrix operations. Thread binding is dynamically set based on the number of available cores to maximize parallelism on Arm CPUs.
178+

0 commit comments

Comments
 (0)