Skip to content

Commit 7a8c687

Browse files
Merge pull request #1377 from madeline-underwood/RTP-LLM-chatbot
Rtp llm chatbot_KB to review
2 parents 4cc83ec + f77a1c1 commit 7a8c687

File tree

6 files changed

+106
-50
lines changed

6 files changed

+106
-50
lines changed

content/learning-paths/servers-and-cloud-computing/rtp-llm/_index.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
11
---
2-
title: Run a Large Language Model (LLM) chatbot with rtp-llm on Arm servers
2+
title: Run an LLM chatbot with rtp-llm on Arm-based servers
33

44
minutes_to_complete: 30
55

6-
who_is_this_for: This is an introductory topic for developers interested in running LLMs on Arm-based servers.
6+
who_is_this_for: This is an introductory topic for developers who are interested in running a Large Language Model (LLM) with rtp-llm on Arm-based servers.
77

88
learning_objectives:
9-
- Build rtp-llm on your Arm server.
9+
- Build rtp-llm on an Arm-based server.
1010
- Download a Qwen model from Hugging Face.
1111
- Run a Large Language Model with rtp-llm.
1212

1313
prerequisites:
14-
- An Arm Neoverse N2 or Neoverse V2 [based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server. This Learning Path was tested on an AliCloud Yitian710 g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance to test Arm performance optimizations.
14+
- Any Arm Neoverse N2-based or Arm Neoverse V2-based instance running Ubuntu 22.04 LTS from a cloud service provider or an on-premise Arm server.
15+
- For the server, at least four cores and 16GB of RAM, with disk storage configured up to at least 32 GB.
1516

1617
author_primary: Tianyu Li
1718

content/learning-paths/servers-and-cloud-computing/rtp-llm/_next-steps.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,12 @@ next_step_guidance: >
44
55
recommended_path: "/learning-paths/servers-and-cloud-computing/nlp-hugging-face/"
66

7+
78
further_reading:
9+
- resource:
10+
title: Qwen2-0.5B-Instruct
11+
link: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
12+
type: website
813
- resource:
914
title: Getting started with RTP-LLM
1015
link: https://github.com/alibaba/rtp-llm
@@ -18,9 +23,10 @@ further_reading:
1823
link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
1924
type: blog
2025
- resource:
21-
title: Qwen2-0.5B-Instruct
22-
link: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
26+
title: Get started with Arm-based cloud instances
27+
link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/
2328
type: website
29+
2430

2531

2632
# ================================================================================

content/learning-paths/servers-and-cloud-computing/rtp-llm/_review.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,23 +2,33 @@
22
review:
33
- questions:
44
question: >
5-
Can you run LLMs on Arm CPUs?
5+
Are at least four cores, 16GB of RAM, and 32GB of disk storage required to run the LLM chatbot using rtp-llm on an Arm-based server?
66
answers:
77
- "Yes"
88
- "No"
99
correct_answer: 1
1010
explanation: >
11-
Yes. The advancements made in the Generative AI space with smaller parameter models make LLM inference on CPUs very efficient.
11+
It depends on the size of the LLM. The higher the number of parameters of the model, the greater the system requirements.
1212
1313
- questions:
1414
question: >
15-
Can rtp-llm be built and run on CPU?
15+
Does the rtp-llm project use the --config=arm option to optimize LLM inference for Arm CPUs?
1616
answers:
1717
- "Yes"
1818
- "No"
1919
correct_answer: 1
2020
explanation: >
21-
Yes. rtp-llm not only support built and run on GPU, but also it can be run on Arm CPU.
21+
rtp-llm uses the GPU for inference by default. rtp-llm optimizes LLM inference on Arm architecture by providing a configuration option --config=arm during the build process.
22+
23+
- questions:
24+
question: >
25+
Is the given Python script the only way to run the LLM chatbot on an Arm AArch64 CPU and output a response from the model?
26+
answers:
27+
- "Yes"
28+
- "No"
29+
correct_answer: 2
30+
explanation: >
31+
rtp-llm can also be deployed as an API server, and the user can use curl or another client to generate an LLM chatbot response.
2232
2333
# ================================================================================
2434
# FIXED, DO NOT MODIFY
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
title: Background
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
Arm CPUs are widely used in ML and AI use cases. In this Learning Path, you will learn how to run the generative AI inference-based use case of an LLM chatbot on an Arm-based CPU. You will do this by deploying the [Qwen2-0.5B-Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on an Arm-based CPU using `rtp-llm`.
9+
10+
11+
{{% notice Note %}}
12+
This Learning Path has been tested on an Alibaba Cloud g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance.
13+
{{% /notice %}}
14+
15+
16+
[rtp-llm](https://github.com/alibaba/rtp-llm) is an open-source C/C++ project developed by Alibaba that enables efficient LLM inference on a variety of hardware.
17+
18+
RTP-LLM is a Large Language Model inference acceleration engine developed by Alibaba. Qwen is the name given to a series of Large Language Models developed by Alibaba Cloud that are capable of performing a variety of tasks.
19+
20+
Alibaba Cloud offer a wide range of models, each suitable for different tasks and use cases.
21+
22+
Besides generating text, they are also able to perform actions such as:
23+
24+
* Answering questions, through information retrieval, and analysis.
25+
* Processing images, and producing written descriptions of visual content.
26+
* Processing audio content.
27+
* Provide multilingual support, with over 27 additional languages, on top of the core languages of English and Chinese.
28+
29+
Qwen is open source, flexible, and encourages contribution from the software development community.
30+
31+
32+
33+

content/learning-paths/servers-and-cloud-computing/rtp-llm/rtp-llm-chatbot.md

Lines changed: 15 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,13 @@
11
---
2-
title: Run a Large Language model (LLM) chatbot with rtp-llm on Arm servers
2+
title: Run an LLM chatbot with rtp-llm on an Arm server
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
9-
## Before you begin
10-
The instructions in this Learning Path are for any Arm Neoverse N2 or Neoverse V2 based server running Ubuntu 22.04 LTS. You need an Arm server instance with at least four cores and 16GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an Alibaba Cloud g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance.
11-
12-
## Overview
13-
14-
Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you will learn how to run generative AI inference-based use case like a LLM chatbot on Arm-based CPUs. You do this by deploying the [Qwen2-0.5B-Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on your Arm-based CPU using `rtp-llm`.
15-
16-
[rtp-llm](https://github.com/alibaba/rtp-llm) is an open source C/C++ project developed by Alibaba that enables efficient LLM inference on a variety of hardware.
17-
188
## Install dependencies
199

20-
Install `micromamba` to setup python 3.10 at path `/opt/conda310`, required by `rtp-llm` build system:
10+
Install `micromamba` to set up python 3.10 at path `/opt/conda310`, as required by the `rtp-llm` build system:
2111

2212
```bash
2313
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
@@ -34,14 +24,14 @@ chmod +x bazelisk-linux-arm64
3424
sudo mv bazelisk-linux-arm64 /usr/bin/bazelisk
3525
```
3626

37-
Install `git/gcc/g++` on your machine:
27+
Install `git/gcc/g++`:
3828

3929
```bash
4030
sudo apt install git -y
4131
sudo apt install build-essential -y
4232
```
4333

44-
Install `openblas` developmwnt package and fix the header paths:
34+
Install the `openblas` development package and fix the header paths:
4535

4636
```bash
4737
sudo apt install libopenblas-dev
@@ -53,28 +43,28 @@ sudo ln -sf /usr/include/aarch64-linux-gnu/cblas.h /usr/include/openblas/cblas.h
5343

5444
You are now ready to start building `rtp-llm`.
5545

56-
Clone the source repository for rtp-llm:
46+
Start by cloning the source repository for rtp-llm:
5747

5848
```bash
5949
git clone https://github.com/alibaba/rtp-llm
6050
cd rtp-llm
6151
git checkout 4656265
6252
```
6353

64-
Comment out the lines 7-10 in `deps/requirements_lock_torch_arm.txt` as some hosts are not accessible from the Internet.
54+
Next, comment out lines 7-10 in `deps/requirements_lock_torch_arm.txt` as some hosts are not accessible from the web:
6555

6656
```bash
6757
sed -i '7,10 s/^/#/' deps/requirements_lock_torch_arm.txt
6858
```
6959

70-
By default, `rtp-llm` builds for GPU only on Linux. You need to provide extra config `--config=arm` to build it for the Arm CPU that you will run it on:
60+
By default, `rtp-llm` builds for GPU only on Linux. You need to provide the additional flag `--config=arm` to build it for the Arm CPU that you will run it on.
7161

7262
Configure and build:
7363

7464
```bash
7565
bazelisk build --config=arm //maga_transformer:maga_transformer_aarch64
7666
```
77-
The output from your build should look like:
67+
The output from your build should look like this:
7868

7969
```output
8070
INFO: 10094 processes: 8717 internal, 1377 local.
@@ -87,7 +77,7 @@ Install the built wheel package:
8777
pip install bazel-bin/maga_transformer/maga_transformer-0.2.0-cp310-cp310-linux_aarch64.whl
8878
```
8979

90-
Create a file named `python-test.py` in your `/tmp` directory with the contents below:
80+
Create a file named `python-test.py` in your `/tmp` directory with the contents shown below:
9181

9282
```python
9383
from maga_transformer.pipeline import Pipeline
@@ -140,7 +130,9 @@ Now run this file:
140130
python /tmp/python-test.py
141131
```
142132

143-
If `rtp-llm` has built correctly on your machine, you will see the LLM model response for the prompt input. A snippet of the output is shown below:
133+
If `rtp-llm` has built correctly on your machine, you will see the LLM model response for the prompt input.
134+
135+
A snippet of the output is shown below:
144136

145137
```output
146138
['I am a large language model created by Alibaba Cloud. My name is Qwen.']
@@ -174,5 +166,7 @@ If `rtp-llm` has built correctly on your machine, you will see the LLM model res
174166
```
175167

176168

177-
You have successfully run a LLM chatbot with Arm optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
169+
You have successfully run a LLM chatbot with Arm optimizations, running on an Arm AArch64 CPU on your server.
170+
171+
You can continue to experiment with the chatbot by trying out different prompts on the model.
178172

content/learning-paths/servers-and-cloud-computing/rtp-llm/rtp-llm-server.md

Lines changed: 31 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,32 @@ weight: 4
55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8+
## Setup
89

9-
You can use the `rtp-llm` server program and submit requests using an OpenAI-compatible API.
10-
This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM.
10+
You can now move on to using the `rtp-llm` server program and submitting requests using an OpenAI-compatible API.
1111

12-
One additional software package is required for this section. Install `jq` on your computer using:
12+
This enables applications to be created which access the LLM multiple times without starting and stopping it.
13+
14+
You can also access the server over the network to another machine hosting the LLM.
15+
16+
One additional software package is required for this section.
17+
18+
Install `jq` on your computer using the following commands:
1319

1420
```bash
1521
sudo apt install jq -y
1622
```
1723

18-
# Running the Server
19-
## Install Hugging Face Hub
24+
## Running the Server
2025

21-
There are a few different ways you can download the Qwen2 0.5B model. In this Learning Path, you download the model from Hugging Face.
26+
There are a few different ways you can download the Qwen2 0.5B model. In this Learning Path, you will download the model from Hugging Face.
2227

23-
[Hugging Face](https://huggingface.co/) is an open source AI community where you can host your own AI models, train them and collaborate with others in the community. You can browse through the thousands of models that are available for a variety of use cases like NLP, audio, and computer vision.
28+
[Hugging Face](https://huggingface.co/) is an open source AI community where you can host your own AI models, train them, and collaborate with others in the community. You can browse through thousands of models that are available for a variety of use cases such as Natural Language Processing (NLP), audio, and computer vision.
2429

2530
The `huggingface_hub` library provides APIs and tools that let you easily download and fine-tune pre-trained models. You will use `huggingface-cli` to download the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct).
2631

32+
## Install Hugging Face Hub
33+
2734
Install the required Python packages:
2835

2936
```bash
@@ -51,14 +58,18 @@ You can now download the model using the huggingface cli:
5158
huggingface-cli download Qwen/Qwen2-0.5B-Instruct
5259
```
5360

54-
## Start rtp-llm server
55-
The server executable has already compiled during the stage detailed in the previous section, when you ran `bazelisk build`. Install the pip wheel in your active virtual environment:
61+
## Start the rtp-llm server
62+
63+
{{% notice Note %}}
64+
The server executable compiled during the previous stage, when you ran `bazelisk build`. {{% /notice %}}
65+
66+
Install the pip wheel in your active virtual environment:
5667

5768
```bash
5869
pip install bazel-bin/maga_transformer/maga_transformer-0.2.0-cp310-cp310-linux_aarch64.whl
5970
pip install grpcio-tools
6071
```
61-
Start the server from the command line, it listens on port 8088:
72+
Start the server from the command line. It listens on port 8088:
6273

6374
```bash
6475
export CHECKPOINT_PATH=${HOME}/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/
@@ -67,8 +78,9 @@ export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
6778
MODEL_TYPE=qwen_2 FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
6879
```
6980

70-
# Client
71-
## Use curl
81+
## Client
82+
83+
### Using curl
7284

7385
You can access the API using the `curl` command.
7486

@@ -90,15 +102,15 @@ curl http://localhost:8088/v1/chat/completions -H "Content-Type: application/jso
90102
}' 2>/dev/null | jq -C
91103
```
92104

93-
The `model` value in the API is not used, you can enter any value. This is because there is only one model loaded in the server.
105+
The `model` value in the API is not used, and you can enter any value. This is because there is only one model loaded in the server.
94106

95107
Run the script:
96108

97109
```bash
98110
bash ./curl-test.sh
99111
```
100112

101-
The `curl` command accesses the LLM and you see the output:
113+
The `curl` command accesses the LLM and you should see the output:
102114

103115
```output
104116
{
@@ -124,9 +136,9 @@ The `curl` command accesses the LLM and you see the output:
124136
}
125137
```
126138

127-
In the returned JSON data you see the LLM output, including the content created from the prompt.
139+
In the returned JSON data, you will see the LLM output, including the content created from the prompt.
128140

129-
## Use Python
141+
### Using Python
130142

131143
You can also use a Python program to access the OpenAI-compatible API.
132144

@@ -165,13 +177,13 @@ for chunk in completion:
165177
print(chunk.choices[0].delta.content or "", end="")
166178
```
167179

168-
Run the Python file (make sure the server is still running):
180+
Ensure that the server is still running, and then run the Python file:
169181

170182
```bash
171183
python ./python-test.py
172184
```
173185

174-
You see the output generated by the LLM:
186+
You should see the output generated by the LLM:
175187

176188
```output
177189
Sure, here's a simple C++ program that prints "Hello, World!" to the console:
@@ -187,4 +199,4 @@ int main() {
187199
This program includes the `iostream` library, which is used for input/output operations. The `main` function is the entry point of the program, and it calls the `cout` object to print the message "Hello, World!" to the console.
188200
```
189201

190-
You can continue to experiment with different large language models and write scripts to access them.
202+
Now you can continue to experiment with different large language models, and have a go at writing scripts to access them.

0 commit comments

Comments
 (0)