Skip to content

Commit 137278a

Browse files
authored
Merge pull request #1859 from pareenaverma/content_review
Tech review of vLLM LP
2 parents 0e9b405 + 693c77f commit 137278a

File tree

4 files changed

+32
-18
lines changed

4 files changed

+32
-18
lines changed

content/learning-paths/servers-and-cloud-computing/vLLM-quant/1-overview.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,7 @@ These packages are needed to build libraries like OpenBLAS and manage system-lev
4040

4141
```bash
4242
sudo apt-get update -y
43-
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-pip
44-
sudo apt install python-is-python3
43+
sudo apt-get install -y gcc-12 g++-12 libnuma-dev make
4544
```
4645
Set the system default compilers to version 12:
4746

@@ -82,12 +81,12 @@ Once the system libraries are in place, install the Python packages required for
8281

8382
Before proceeding, make sure the following files are downloaded to your home directory:
8483
```bash
85-
84+
[PLACEHOLDER]
8685
```
8786
These are required to complete the installation and model quantization steps.
8887
Now, navigate to your home directory:
8988
```bash
90-
cd /home/ubuntu/
89+
cd $HOME
9190
```
9291

9392
Install the vLLM wheel. This wheel contains the CPU-optimized version of `vLLM`, built specifically for Arm architecture. Installing it from a local `.whl` file ensures compatibility with the rest of your environment and avoids potential conflicts from nightly or default pip installations.
@@ -101,12 +100,12 @@ pip install llmcompressor
101100
```
102101
Install torchvision (nightly version for CPU):
103102
```bash
104-
pip install --force-reinstall torchvision==0.22.0.dev20250213 --extra-index-url https://download.pytorch.org/whl/nightly/cpu
105-
```
103+
pip install --force-reinstall torchvision==0.22.0.dev20250223 --extra-index-url https://download.pytorch.org/whl/nightly/cpu```
104+
106105
Install the custom PyTorch CPU wheel:<br>
107106
This custom PyTorch wheel is prebuilt for Arm CPU architectures and includes the necessary optimizations for running inference. Installing it locally ensures compatibility with your environment and avoids conflicts with default pip packages.
108107
```bash
109108
pip install torch-2.7.0.dev20250306-cp312-cp312-manylinux_2_28_aarch64.whl --force-reinstall --no-deps
110109
```
111110

112-
You’re now ready to quantize the model and start serving it with `vLLM` on an Arm-based system.
111+
You’re now ready to quantize the model and start serving it with `vLLM` on an Arm-based system.

content/learning-paths/servers-and-cloud-computing/vLLM-quant/2-quantize-model.md

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ huggingface-cli login --token $hf_token
1515
```
1616
## Quantization Script Template
1717

18-
Create the `vllm_quantize_model.py` script shown below to quantize the model :
18+
Using a file editor of your choice, create a file named `vllm_quantize_model.py` and copy the content shown below to quantize the model:
1919
```bash
2020
import argparse
2121
import os
@@ -153,11 +153,10 @@ if __name__ == "__main__":
153153
Then run the quantization script using `vllm_quantize_model.py`. This generates an INT8 quantized version of the model using channelwise precision, which reduces memory usage while maintaining model accuracy:
154154
155155
```bash
156-
cd /home/ubuntu/
157156
python vllm_quantize_model.py meta-llama/Llama-3.1-8B-Instruct --mode int8 --scheme channelwise
158157
```
159-
The output model will be saved locally at:
160-
`/home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise`.
158+
The quantized model will be saved at:
159+
`$HOME/Llama-3.1-8B-Instruct-w8a8-channelwise`.
161160
162161
## Launch the vLLM server
163162
@@ -171,8 +170,20 @@ ONEDNN_DEFAULT_FPMATH_MODE=BF16 \
171170
VLLM_TARGET_DEVICE=cpu \
172171
VLLM_CPU_KVCACHE_SPACE=32 \
173172
VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc) - 1))" \
174-
vllm serve /home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise \
173+
vllm serve $HOME/Llama-3.1-8B-Instruct-w8a8-channelwise \
175174
--dtype float32 --swap-space 16
176175
```
177176
This command starts the vLLM server using the quantized model. It preloads `tcmalloc` for efficient memory allocation and uses OpenBLAS for accelerated matrix operations. Thread binding is dynamically set based on the number of available cores to maximize parallelism on Arm CPUs.
178177
178+
The output from launching the vLLM server with the quantized model should look like:
179+
180+
```output
181+
INFO 04-23 21:13:59 launcher.py:31] Route: /rerank, Methods: POST
182+
INFO 04-23 21:13:59 launcher.py:31] Route: /v1/rerank, Methods: POST
183+
INFO 04-23 21:13:59 launcher.py:31] Route: /v2/rerank, Methods: POST
184+
INFO 04-23 21:13:59 launcher.py:31] Route: /invocations, Methods: POST
185+
INFO: Started server process [77356]
186+
INFO: Waiting for application startup.
187+
INFO: Application startup complete.
188+
```
189+

content/learning-paths/servers-and-cloud-computing/vLLM-quant/3-run-benchmark.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
## Run Single Inference
1010

11-
Once the server is running, start by verifying it with a basic single-prompt request using `curl`. This confirms the server is running correctly and that the OpenAI-compatible /v1/chat/completions API is responding as expected:
11+
Once the server is running, open another terminal and verify it is running as expected with a basic single-prompt request using `curl`. This confirms the server is running correctly and that the OpenAI-compatible /v1/chat/completions API is responding as expected:
1212

1313
```bash
1414
curl http://localhost:8000/v1/chat/completions \
@@ -51,8 +51,8 @@ After confirming single-prompt inference, run batch testing to simulate concurre
5151

5252
Use the following Python script to simulate concurrent user interactions.
5353

54-
Save it as `batch_test.py`:
55-
```bash
54+
Save the content shown below in a file named `batch_test.py`:
55+
```python
5656
import requests
5757
import json
5858
import os
@@ -151,10 +151,10 @@ if __name__ == "__main__":
151151
Then, run it using:
152152

153153
```bash
154-
python batch_test.py localhost 8000 --schema http --batch 16 -m /home/ubuntu/Llama-3.1-8B-Instruct-w8a8-channelwise
154+
python3 batch_test.py localhost 8000 --schema http --batch 16 -m $HOME/Llama-3.1-8B-Instruct-w8a8-channelwise
155155
```
156156
This simulates multiple users interacting with the model in parallel and helps validate server-side performance under load.
157-
You can modify the number of requests using the --batch flag or review/edit batch_test.py to customize prompt content and concurrency logic.
157+
You can modify the number of requests using the --batch flag or review and edit `batch_test.py` to customize prompt content and concurrency logic.
158158

159159
When the test completes, server logs will display a summary including average prompt throughput and generation throughput. This helps benchmark how well the model performs under concurrent load on your Arm-based system.
160160

content/learning-paths/servers-and-cloud-computing/vLLM-quant/_index.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
---
22
title: Quantize and Run a Large Language Model using vLLM on Arm Servers
33

4+
draft: true
5+
cascade:
6+
draft: true
47

58
minutes_to_complete: 45
69

@@ -15,11 +18,12 @@ learning_objectives:
1518

1619

1720
prerequisites:
18-
- An Arm-based server or cloud instance running with at least 32 CPU cores, 64 GB RAM and 80 GB of available disk space.
21+
- An Arm-based server or cloud instance running with at least 32 CPU cores, 64 GB RAM and 32 GB of available disk space.
1922
- Familiarity with Python and machine learning concepts.
2023
- An active Hugging Face account with access to the target model.
2124

2225
author: Rani Chowdary Mandepudi
26+
Phalani Paladugu
2327

2428
### Tags
2529
skilllevels: Introductory

0 commit comments

Comments
 (0)