Skip to content

Commit a88f070

Browse files
authored
Merge pull request #1363 from TianyuLi0/add_rtp_llm
add rtp-llm LLM chatbot LP
2 parents 548c1ef + 94ce062 commit a88f070

File tree

5 files changed

+465
-0
lines changed

5 files changed

+465
-0
lines changed
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
---
2+
title: Run a Large Language Model (LLM) chatbot with rtp-llm on Arm servers
3+
4+
minutes_to_complete: 30
5+
6+
who_is_this_for: This is an introductory topic for developers interested in running LLMs on Arm-based servers.
7+
8+
learning_objectives:
9+
- Build rtp-llm on your Arm server.
10+
- Download a Qwen model from Hugging Face.
11+
- Run a Large Language Model with rtp-llm.
12+
13+
prerequisites:
14+
- An Arm Neoverse N2 or Neoverse V2 [based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server. This Learning Path was tested on an AliCloud Yitian710 g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance to test Arm performance optimizations.
15+
16+
author_primary: Tianyu Li
17+
18+
### Tags
19+
skilllevels: Introductory
20+
subjects: ML
21+
armips:
22+
- Neoverse
23+
operatingsystems:
24+
- Linux
25+
tools_software_languages:
26+
- LLM
27+
- GenAI
28+
- Python
29+
30+
31+
### FIXED, DO NOT MODIFY
32+
# ================================================================================
33+
weight: 1 # _index.md always has weight of 1 to order correctly
34+
layout: "learningpathall" # All files under learning paths have this same wrapper
35+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
36+
---
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
next_step_guidance: >
3+
Thank you for completing this Learning path on how to run a LLM chatbot on an Arm-based server. You might be interested in learning how to run a NLP sentiment analysis model on an Arm-based server.
4+
5+
recommended_path: "/learning-paths/servers-and-cloud-computing/nlp-hugging-face/"
6+
7+
further_reading:
8+
- resource:
9+
title: Getting started with RTP-LLM
10+
link: https://github.com/alibaba/rtp-llm
11+
type: documentation
12+
- resource:
13+
title: Hugging Face Documentation
14+
link: https://huggingface.co/docs
15+
type: documentation
16+
- resource:
17+
title: Democratizing Generative AI with CPU-based inference
18+
link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
19+
type: blog
20+
- resource:
21+
title: Qwen2-0.5B-Instruct
22+
link: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
23+
type: website
24+
25+
26+
# ================================================================================
27+
# FIXED, DO NOT MODIFY
28+
# ================================================================================
29+
weight: 21 # set to always be larger than the content in this path, and one more than 'review'
30+
title: "Next Steps" # Always the same
31+
layout: "learningpathall" # All files under learning paths have this same wrapper
32+
---
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
review:
3+
- questions:
4+
question: >
5+
Can you run LLMs on Arm CPUs?
6+
answers:
7+
- "Yes"
8+
- "No"
9+
correct_answer: 1
10+
explanation: >
11+
Yes. The advancements made in the Generative AI space with smaller parameter models make LLM inference on CPUs very efficient.
12+
13+
- questions:
14+
question: >
15+
Can rtp-llm be built and run on CPU?
16+
answers:
17+
- "Yes"
18+
- "No"
19+
correct_answer: 1
20+
explanation: >
21+
Yes. rtp-llm not only support built and run on GPU, but also it can be run on Arm CPU.
22+
23+
# ================================================================================
24+
# FIXED, DO NOT MODIFY
25+
# ================================================================================
26+
title: "Review" # Always the same title
27+
weight: 20 # Set to always be larger than the content in this path
28+
layout: "learningpathall" # All files under learning paths have this same wrapper
29+
---
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
---
2+
title: Run a Large Language model (LLM) chatbot with rtp-llm on Arm servers
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Before you begin
10+
The instructions in this Learning Path are for any Arm Neoverse N2 or Neoverse V2 based server running Ubuntu 22.04 LTS. You need an Arm server instance with at least four cores and 16GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an Alibaba Cloud g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance.
11+
12+
## Overview
13+
14+
Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you will learn how to run generative AI inference-based use case like a LLM chatbot on Arm-based CPUs. You do this by deploying the [Qwen2-0.5B-Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on your Arm-based CPU using `rtp-llm`.
15+
16+
[rtp-llm](https://github.com/alibaba/rtp-llm) is an open source C/C++ project developed by Alibaba that enables efficient LLM inference on a variety of hardware.
17+
18+
## Install dependencies
19+
20+
Install `micromamba` to setup python 3.10 at path `/opt/conda310`, required by `rtp-llm` build system:
21+
22+
```bash
23+
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
24+
source ~/.bashrc
25+
sudo ${HOME}/.local/bin/micromamba -r /opt/conda310 install python=3.10
26+
micromamba -r /opt/conda310 shell
27+
```
28+
29+
Install `bazelisk` to build `rtp-llm`:
30+
31+
```bash
32+
wget https://github.com/bazelbuild/bazelisk/releases/download/v1.22.1/bazelisk-linux-arm64
33+
chmod +x bazelisk-linux-arm64
34+
sudo mv bazelisk-linux-arm64 /usr/bin/bazelisk
35+
```
36+
37+
Install `git/gcc/g++` on your machine:
38+
39+
```bash
40+
sudo apt install git -y
41+
sudo apt install build-essential -y
42+
```
43+
44+
Install `openblas` developmwnt package and fix the header paths:
45+
46+
```bash
47+
sudo apt install libopenblas-dev
48+
sudo mkdir -p /usr/include/openblas
49+
sudo ln -sf /usr/include/aarch64-linux-gnu/cblas.h /usr/include/openblas/cblas.h
50+
```
51+
52+
## Download and build rtp-llm
53+
54+
You are now ready to start building `rtp-llm`.
55+
56+
Clone the source repository for rtp-llm:
57+
58+
```bash
59+
git clone https://github.com/alibaba/rtp-llm
60+
cd rtp-llm
61+
git checkout 4656265
62+
```
63+
64+
Comment out the lines 7-10 in `deps/requirements_lock_torch_arm.txt` as some hosts are not accessible from the Internet.
65+
66+
```bash
67+
sed -i '7,10 s/^/#/' deps/requirements_lock_torch_arm.txt
68+
```
69+
70+
By default, `rtp-llm` builds for GPU only on Linux. You need to provide extra config `--config=arm` to build it for the Arm CPU that you will run it on:
71+
72+
Configure and build:
73+
74+
```bash
75+
bazelisk build --config=arm //maga_transformer:maga_transformer_aarch64
76+
```
77+
The output from your build should look like:
78+
79+
```output
80+
INFO: 10094 processes: 8717 internal, 1377 local.
81+
INFO: Build completed successfully, 10094 total actions
82+
```
83+
84+
Install the built wheel package:
85+
86+
```bash
87+
pip install bazel-bin/maga_transformer/maga_transformer-0.2.0-cp310-cp310-linux_aarch64.whl
88+
```
89+
90+
Create a file named `python-test.py` in your `/tmp` directory with the contents below:
91+
92+
```python
93+
from maga_transformer.pipeline import Pipeline
94+
from maga_transformer.model_factory import ModelFactory
95+
from maga_transformer.openai.openai_endpoint import OpenaiEndopoint
96+
from maga_transformer.openai.api_datatype import ChatCompletionRequest, ChatMessage, RoleEnum
97+
from maga_transformer.distribute.worker_info import update_master_info
98+
99+
import asyncio
100+
import json
101+
import os
102+
103+
async def main():
104+
update_master_info('127.0.0.1', 42345)
105+
os.environ["MODEL_TYPE"] = os.environ.get("MODEL_TYPE", "qwen2")
106+
os.environ["CHECKPOINT_PATH"] = os.environ.get("CHECKPOINT_PATH", "Qwen/Qwen2-0.5B-Instruct")
107+
os.environ["RESERVER_RUNTIME_MEM_MB"] = "0"
108+
os.environ["DEVICE_RESERVE_MEMORY_BYTES"] = f"{128 * 1024 ** 2}"
109+
model_config = ModelFactory.create_normal_model_config()
110+
model = ModelFactory.from_huggingface(model_config.ckpt_path, model_config=model_config)
111+
pipeline = Pipeline(model, model.tokenizer)
112+
113+
# usual request
114+
for res in pipeline("<|im_start|>user\nhello, what's your name<|im_end|>\n<|im_start|>assistant\n", max_new_tokens = 100):
115+
print(res.generate_texts)
116+
117+
# openai request
118+
openai_endpoint = OpenaiEndopoint(model)
119+
messages = [
120+
ChatMessage(**{
121+
"role": RoleEnum.user,
122+
"content": "Who are you?",
123+
}),
124+
]
125+
request = ChatCompletionRequest(messages=messages, stream=False)
126+
response = openai_endpoint.chat_completion(request_id=0, chat_request=request, raw_request=None)
127+
async for res in response:
128+
pass
129+
print((await response.gen_complete_response_once()).model_dump_json(indent=4))
130+
131+
pipeline.stop()
132+
133+
if __name__ == '__main__':
134+
asyncio.run(main())
135+
```
136+
137+
Now run this file:
138+
139+
```bash
140+
python /tmp/python-test.py
141+
```
142+
143+
If `rtp-llm` has built correctly on your machine, you will see the LLM model response for the prompt input. A snippet of the output is shown below:
144+
145+
```output
146+
['I am a large language model created by Alibaba Cloud. My name is Qwen.']
147+
{
148+
"id": "chat-",
149+
"object": "chat.completion",
150+
"created": 1730272196,
151+
"model": "AsyncModel",
152+
"choices": [
153+
{
154+
"index": 0,
155+
"message": {
156+
"role": "assistant",
157+
"content": "I am a large language model created by Alibaba Cloud. I am called Qwen.",
158+
"function_call": null,
159+
"tool_calls": null
160+
},
161+
"finish_reason": "stop"
162+
}
163+
],
164+
"usage": {
165+
"prompt_tokens": 23,
166+
"total_tokens": 40,
167+
"completion_tokens": 17,
168+
"completion_tokens_details": null,
169+
"prompt_tokens_details": null
170+
},
171+
"debug_info": null,
172+
"aux_info": null
173+
}
174+
```
175+
176+
177+
You have successfully run a LLM chatbot with Arm optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
178+

0 commit comments

Comments
 (0)