Skip to content

Commit 0f845e3

Browse files
authored
Merge pull request #1459 from madeline-underwood/vLLM
vLLM_AP approved
2 parents 37c3ead + 78ba5bc commit 0f845e3

File tree

6 files changed

+43
-47
lines changed

6 files changed

+43
-47
lines changed

content/learning-paths/servers-and-cloud-computing/vLLM/_index.md

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,18 @@
11
---
2-
title: Large language models (LLMs) on Arm servers with vLLM
3-
4-
draft: true
5-
cascade:
6-
draft: true
2+
title: Build and Run a Virtual Large Language Model on Arm Servers
73

84
minutes_to_complete: 45
95

10-
who_is_this_for: This is an introductory topic for software developers and AI engineers interested in learning how to use vLLM (Virtual Large Language Model) on Arm servers.
6+
who_is_this_for: This is an introductory topic for software developers and AI engineers interested in learning how to use a vLLM (Virtual Large Language Model) on Arm servers.
117

128
learning_objectives:
13-
- Build vLLM from source on an Arm server.
9+
- Build a vLLM from source on an Arm server.
1410
- Download a Qwen LLM from Hugging Face.
15-
- Run local batch inference using vLLM.
16-
- Create and interact with an OpenAI compatible server provided by vLLM on your Arm server..
11+
- Run local batch inference using a vLLM.
12+
- Create and interact with an OpenAI-compatible server provided by a vLLM on your Arm server.
1713

1814
prerequisites:
19-
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or a local Arm Linux computer with at least 8 CPUs and 16 GB RAM.
15+
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider, or a local Arm Linux computer with at least 8 CPUs and 16 GB RAM.
2016

2117
author_primary: Jason Andrews
2218

content/learning-paths/servers-and-cloud-computing/vLLM/_next-steps.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
next_step_guidance: >
3-
Thank you for completing this learning path on how to build and run vLLM on Arm servers. You might be interested in learning how to further optimize and benchmark LLM performance on Arm-based platforms.
3+
Thank you for completing this Learning Path on how to build and run vLLM on Arm servers. You might be interested in learning how to further optimize and benchmark LLM performance on Arm-based platforms.
44
55
recommended_path: "/learning-paths/servers-and-cloud-computing/benchmark-nlp/"
66

content/learning-paths/servers-and-cloud-computing/vLLM/_review.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ review:
55
question: >
66
What is the primary purpose of vLLM?
77
answers:
8-
- "Operating System Development"
9-
- "Large Language Model Inference and Serving"
10-
- "Database Management"
8+
- "Operating System Development."
9+
- "Large Language Model Inference and Serving."
10+
- "Database Management."
1111
correct_answer: 2
1212
explanation: >
1313
vLLM is designed for fast and efficient Large Language Model inference and serving.
@@ -16,10 +16,10 @@ review:
1616
question: >
1717
In addition to Python, which extra programming languages are required by the vLLM build system?
1818
answers:
19-
- "Java"
20-
- "Rust"
21-
- "C++"
22-
- "Rust and C++"
19+
- "Java."
20+
- "Rust."
21+
- "C++."
22+
- "Rust and C++."
2323
correct_answer: 4
2424
explanation: >
2525
The vLLM build system requires the Rust toolchain and GCC for its compilation.
@@ -28,9 +28,9 @@ review:
2828
question: >
2929
What is the VLLM_TARGET_DEVICE environment variable set to for building vLLM for Arm CPUs?
3030
answers:
31-
- "cuda"
32-
- "gpu"
33-
- "cpu"
31+
- "cuda."
32+
- "gpu."
33+
- "cpu."
3434
correct_answer: 3
3535
explanation: >
3636
The VLLM_TARGET_DEVICE environment variable needs to be set to cpu to target the Arm processor.

content/learning-paths/servers-and-cloud-computing/vLLM/vllm-run.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,27 +8,27 @@ layout: learningpathall
88

99
## Use a model from Hugging Face
1010

11-
vLLM is designed to work seamlessly with models from the Hugging Face Hub,
11+
vLLM is designed to work seamlessly with models from the Hugging Face Hub.
1212

13-
The first time you run vLLM it downloads the required model. This means you don't have to explicitly download any models.
13+
The first time you run vLLM, it downloads the required model. This means that you do not have to explicitly download any models.
1414

15-
If you want to use a model that requires you to request access or accept terms, you need to log in to Hugging Face using a token.
15+
If you want to use a model that requires you to request access or accept the terms, you need to log in to Hugging Face using a token.
1616

1717
```bash
1818
huggingface-cli login
1919
```
2020

21-
Enter your Hugging Face token. You can generate a token from [Hugging Face Hub](https://huggingface.co/) by clicking your profile on the top right corner and selecting `Access Tokens`.
21+
Enter your Hugging Face token. You can generate a token from [Hugging Face Hub](https://huggingface.co/) by clicking your profile on the top right corner and selecting **Access Tokens**.
2222

23-
You also need to visit the Hugging Face link printed in the login output and accept the terms by clicking the "Agree and access repository" button or filling out the request for access form (depending on the model).
23+
You also need to visit the Hugging Face link printed in the login output and accept the terms by clicking the **Agree and access repository** button or filling out the request-for-access form, depending on the model.
2424

2525
To run batched inference without the need for a login, you can use the `Qwen/Qwen2.5-0.5B-Instruct` model.
2626

2727
## Create a batch script
2828

29-
To run inference with multiple prompts you can create a simple Python script to load a model and run the prompts.
29+
To run inference with multiple prompts, you can create a simple Python script to load a model and run the prompts.
3030

31-
Use a text editor to save the Python script below in a file called `batch.py`.
31+
Use a text editor to save the Python script below in a file called `batch.py`:
3232

3333
```python
3434
import json
@@ -72,7 +72,7 @@ Run the Python script:
7272
python ./batch.py
7373
```
7474

75-
The output shows vLLM starting, the model loading, and the batch processing of the 3 prompts:
75+
The output shows vLLM starting, the model loading, and the batch processing of the three prompts:
7676

7777
```output
7878
INFO 12-12 22:52:57 config.py:441] This model supports multiple tasks: {'generate', 'reward', 'embed', 'score', 'classify'}. Defaulting to 'generate'.
@@ -107,4 +107,4 @@ Processed prompts: 100%|██████████████████
107107

108108
You can try with other prompts and models such as `meta-llama/Llama-3.2-1B`.
109109

110-
Continue to learn how to setup an OpenAI compatible server.
110+
Continue to learn how to set up an OpenAI-compatible server.

content/learning-paths/servers-and-cloud-computing/vLLM/vllm-server.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
---
2-
title: Run an OpenAI compatible server
2+
title: Run an OpenAI-compatible server
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
Instead of a batch run from Python, you can create an OpenAI compatible server. This allows you to leverage the power of large language models without relying on external APIs.
9+
Instead of a batch run from Python, you can create an OpenAI-compatible server. This allows you to leverage the power of Large Language Models without relying on external APIs.
1010

1111
Running a local LLM offers several advantages:
1212

13-
Cost-Effective: Avoids the costs associated with using external APIs, especially for high-usage scenarios.  
14-
Privacy: Keeps your data and prompts within your local environment, enhancing privacy and security.
15-
Offline Capability: Enables operation without an internet connection, making it ideal for scenarios with limited or unreliable network access.
13+
* Cost-effective - it avoids the costs associated with using external APIs, especially for high-usage scenarios.  
14+
* Privacy - it keeps your data and prompts within your local environment, which enhances privacy and security.
15+
* Offline Capability - it enables operation without an internet connection, making it ideal for scenarios with limited or unreliable network access.
1616

17-
OpenAI compatibility means you can reuse existing software which was designed to communicate with OpenAI and have it talk to your local vLLM service.
17+
OpenAI compatibility means that you can reuse existing software which was designed to communicate with OpenAI and use it to communicate with your local vLLM service.
1818

1919
Run vLLM with the same `Qwen/Qwen2.5-0.5B-Instruct` model:
2020

@@ -72,12 +72,12 @@ curl http://0.0.0.0:8000/v1/chat/completions \
7272
}'
7373
```
7474

75-
The server processes the request and the output prints the results:
75+
The server processes the request, and the output prints the results:
7676

7777
```output
7878
"id":"chatcmpl-6677cb4263b34d18b436b9cb8c6a5a65","object":"chat.completion","created":1734044182,"model":"Qwen/Qwen2.5-0.5B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Certainly! Here is a simple \"Hello, World!\" program in C:\n\n```c\n#include <stdio.h>\n\nint main() {\n printf(\"Hello, World!\\n\");\n return 0;\n}\n```\n\nThis program defines a function called `main` which contains the body of the program. Inside the `main` function, it calls the `printf` function to display the text \"Hello, World!\" to the console. The `return 0` statement indicates that the program was successful and the program has ended.\n\nTo compile and run this program:\n\n1. Save the code above to a file named `hello.c`.\n2. Open a terminal or command prompt.\n3. Navigate to the directory where you saved the file.\n4. Compile the program using the following command:\n ```\n gcc hello.c -o hello\n ```\n5. Run the compiled program using the following command:\n ```\n ./hello\n ```\n Or simply type `hello` in the terminal.\n\nYou should see the output:\n\n```\nHello, World!\n```","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":26,"total_tokens":241,"completion_tokens":215,"prompt_tokens_details":null},"prompt_logprobs":null}
7979
```
8080

81-
There are many other experiments you can try. Most Hugging Face models have a `Use this model` button on the top right of the model card with the instructions for vLLM. You can now use these instructions on your Arm Linux computer.
81+
There are many other experiments you can try. Most Hugging Face models have a **Use this model** button on the top-right of the model card with the instructions for vLLM. You can now use these instructions on your Arm Linux computer.
8282

83-
You can also try out OpenAI compatible chat clients to connect to the served model.
83+
You can also try out OpenAI-compatible chat clients to connect to the served model.

content/learning-paths/servers-and-cloud-computing/vLLM/vllm-setup.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Build vLLM from source code
2+
title: Build a vLLM from Source Code
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
@@ -8,13 +8,13 @@ layout: learningpathall
88

99
## Before you begin
1010

11-
You can follow the instructions for this Learning Path using an Arm server running Ubuntu 24.04 LTS with at least 8 cores, 16GB of RAM, and 50GB of disk storage.
11+
To follow the instructions for this Learning Path, you will need an Arm server running Ubuntu 24.04 LTS with at least 8 cores, 16GB of RAM, and 50GB of disk storage.
1212

1313
## What is vLLM?
1414

1515
[vLLM](https://github.com/vllm-project/vllm) stands for Virtual Large Language Model, and is a fast and easy-to-use library for inference and model serving.
1616

17-
vLLM can be used in batch mode or by running an OpenAI compatible server.
17+
You can use vLLM in batch mode, or by running an OpenAI-compatible server.
1818

1919
In this Learning Path, you will learn how to build vLLM from source and run inference on an Arm-based server, highlighting its effectiveness.
2020

@@ -33,7 +33,7 @@ Set the default GCC to version 12:
3333
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
3434
```
3535

36-
Install Rust, refer to the [Rust install guide](/install-guides/rust/) if necessary.
36+
Next, install Rust. For more information, see the [Rust install guide](/install-guides/rust/).
3737

3838
```bash
3939
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
@@ -42,7 +42,7 @@ source "$HOME/.cargo/env"
4242

4343
Four environment variables are required. You can enter these at the command line or add them to your `$HOME/.bashrc` file and source the file.
4444

45-
To add them at the command line:
45+
To add them at the command line, use the following:
4646

4747
```bash
4848
export CCACHE_DIR=/home/ubuntu/.cache/ccache
@@ -58,9 +58,9 @@ python -m venv env
5858
source env/bin/activate
5959
```
6060

61-
Your command line prompt has `(env)` in front of it indicating you are in the Python virtual environment.
61+
Your command-line prompt is prefixed by `(env)`, which indicates that you are in the Python virtual environment.
6262

63-
Update Pip and install Python packages:
63+
Now update Pip and install Python packages:
6464

6565
```bash
6666
pip install --upgrade pip
@@ -69,7 +69,7 @@ pip install py-cpuinfo
6969

7070
### How do I download vLLM and build it?
7171

72-
Clone the vLLM repository from GitHub:
72+
First, clone the vLLM repository from GitHub:
7373

7474
```bash
7575
git clone https://github.com/vllm-project/vllm.git

0 commit comments

Comments
 (0)