|
| 1 | +--- |
| 2 | +title: Run a Large Language model (LLM) chatbot with rtp-llm on Arm servers |
| 3 | +weight: 3 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +## Before you begin |
| 10 | +The instructions in this Learning Path are for any Arm Neoverse N2 or Neoverse V2 based server running Ubuntu 22.04 LTS. You need an Arm server instance with at least four cores and 16GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an Alibaba Cloud g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance. |
| 11 | + |
| 12 | +## Overview |
| 13 | + |
| 14 | +Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you will learn how to run generative AI inference-based use case like a LLM chatbot on Arm-based CPUs. You do this by deploying the [Qwen2-0.5B-Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on your Arm-based CPU using `rtp-llm`. |
| 15 | + |
| 16 | +[rtp-llm](https://github.com/alibaba/rtp-llm) is an open source C/C++ project developed by Alibaba that enables efficient LLM inference on a variety of hardware. |
| 17 | + |
| 18 | +## Install dependencies |
| 19 | + |
| 20 | +Install `micromamba` to setup python 3.10 at path `/opt/conda310`, required by `rtp-llm` build system: |
| 21 | + |
| 22 | +```bash |
| 23 | +"${SHELL}" <(curl -L micro.mamba.pm/install.sh) |
| 24 | +source ~/.bashrc |
| 25 | +sudo ${HOME}/.local/bin/micromamba -r /opt/conda310 install python=3.10 |
| 26 | +micromamba -r /opt/conda310 shell |
| 27 | +``` |
| 28 | + |
| 29 | +Install `bazelisk` to build `rtp-llm`: |
| 30 | + |
| 31 | +```bash |
| 32 | +wget https://github.com/bazelbuild/bazelisk/releases/download/v1.22.1/bazelisk-linux-arm64 |
| 33 | +chmod +x bazelisk-linux-arm64 |
| 34 | +sudo mv bazelisk-linux-arm64 /usr/bin/bazelisk |
| 35 | +``` |
| 36 | + |
| 37 | +Install `git/gcc/g++` on your machine: |
| 38 | + |
| 39 | +```bash |
| 40 | +sudo apt install git -y |
| 41 | +sudo apt install build-essential -y |
| 42 | +``` |
| 43 | + |
| 44 | +Install `openblas` developmwnt package and fix the header paths: |
| 45 | + |
| 46 | +```bash |
| 47 | +sudo apt install libopenblas-dev |
| 48 | +sudo mkdir -p /usr/include/openblas |
| 49 | +sudo ln -sf /usr/include/aarch64-linux-gnu/cblas.h /usr/include/openblas/cblas.h |
| 50 | +``` |
| 51 | + |
| 52 | +## Download and build rtp-llm |
| 53 | + |
| 54 | +You are now ready to start building `rtp-llm`. |
| 55 | + |
| 56 | +Clone the source repository for rtp-llm: |
| 57 | + |
| 58 | +```bash |
| 59 | +git clone https://github.com/alibaba/rtp-llm |
| 60 | +cd rtp-llm |
| 61 | +git checkout 4656265 |
| 62 | +``` |
| 63 | + |
| 64 | +Comment out the lines 7-10 in `deps/requirements_lock_torch_arm.txt` as some hosts are not accessible from the Internet. |
| 65 | + |
| 66 | +```bash |
| 67 | +sed -i '7,10 s/^/#/' deps/requirements_lock_torch_arm.txt |
| 68 | +``` |
| 69 | + |
| 70 | +By default, `rtp-llm` builds for GPU only on Linux. You need to provide extra config `--config=arm` to build it for the Arm CPU that you will run it on: |
| 71 | + |
| 72 | +Configure and build: |
| 73 | + |
| 74 | +```bash |
| 75 | +bazelisk build --config=arm //maga_transformer:maga_transformer_aarch64 |
| 76 | +``` |
| 77 | +The output from your build should look like: |
| 78 | + |
| 79 | +```output |
| 80 | +INFO: 10094 processes: 8717 internal, 1377 local. |
| 81 | +INFO: Build completed successfully, 10094 total actions |
| 82 | +``` |
| 83 | + |
| 84 | +Install the built wheel package: |
| 85 | + |
| 86 | +```bash |
| 87 | +pip install bazel-bin/maga_transformer/maga_transformer-0.2.0-cp310-cp310-linux_aarch64.whl |
| 88 | +``` |
| 89 | + |
| 90 | +Create a file named `python-test.py` in your `/tmp` directory with the contents below: |
| 91 | + |
| 92 | +```python |
| 93 | +from maga_transformer.pipeline import Pipeline |
| 94 | +from maga_transformer.model_factory import ModelFactory |
| 95 | +from maga_transformer.openai.openai_endpoint import OpenaiEndopoint |
| 96 | +from maga_transformer.openai.api_datatype import ChatCompletionRequest, ChatMessage, RoleEnum |
| 97 | +from maga_transformer.distribute.worker_info import update_master_info |
| 98 | + |
| 99 | +import asyncio |
| 100 | +import json |
| 101 | +import os |
| 102 | + |
| 103 | +async def main(): |
| 104 | + update_master_info('127.0.0.1', 42345) |
| 105 | + os.environ["MODEL_TYPE"] = os.environ.get("MODEL_TYPE", "qwen2") |
| 106 | + os.environ["CHECKPOINT_PATH"] = os.environ.get("CHECKPOINT_PATH", "Qwen/Qwen2-0.5B-Instruct") |
| 107 | + os.environ["RESERVER_RUNTIME_MEM_MB"] = "0" |
| 108 | + os.environ["DEVICE_RESERVE_MEMORY_BYTES"] = f"{128 * 1024 ** 2}" |
| 109 | + model_config = ModelFactory.create_normal_model_config() |
| 110 | + model = ModelFactory.from_huggingface(model_config.ckpt_path, model_config=model_config) |
| 111 | + pipeline = Pipeline(model, model.tokenizer) |
| 112 | + |
| 113 | + # usual request |
| 114 | + for res in pipeline("<|im_start|>user\nhello, what's your name<|im_end|>\n<|im_start|>assistant\n", max_new_tokens = 100): |
| 115 | + print(res.generate_texts) |
| 116 | + |
| 117 | + # openai request |
| 118 | + openai_endpoint = OpenaiEndopoint(model) |
| 119 | + messages = [ |
| 120 | + ChatMessage(**{ |
| 121 | + "role": RoleEnum.user, |
| 122 | + "content": "Who are you?", |
| 123 | + }), |
| 124 | + ] |
| 125 | + request = ChatCompletionRequest(messages=messages, stream=False) |
| 126 | + response = openai_endpoint.chat_completion(request_id=0, chat_request=request, raw_request=None) |
| 127 | + async for res in response: |
| 128 | + pass |
| 129 | + print((await response.gen_complete_response_once()).model_dump_json(indent=4)) |
| 130 | + |
| 131 | + pipeline.stop() |
| 132 | + |
| 133 | +if __name__ == '__main__': |
| 134 | + asyncio.run(main()) |
| 135 | +``` |
| 136 | + |
| 137 | +Now run this file: |
| 138 | + |
| 139 | +```bash |
| 140 | +python /tmp/python-test.py |
| 141 | +``` |
| 142 | + |
| 143 | +If `rtp-llm` has built correctly on your machine, you will see the LLM model response for the prompt input. A snippet of the output is shown below: |
| 144 | + |
| 145 | +```output |
| 146 | +['I am a large language model created by Alibaba Cloud. My name is Qwen.'] |
| 147 | +{ |
| 148 | + "id": "chat-", |
| 149 | + "object": "chat.completion", |
| 150 | + "created": 1730272196, |
| 151 | + "model": "AsyncModel", |
| 152 | + "choices": [ |
| 153 | + { |
| 154 | + "index": 0, |
| 155 | + "message": { |
| 156 | + "role": "assistant", |
| 157 | + "content": "I am a large language model created by Alibaba Cloud. I am called Qwen.", |
| 158 | + "function_call": null, |
| 159 | + "tool_calls": null |
| 160 | + }, |
| 161 | + "finish_reason": "stop" |
| 162 | + } |
| 163 | + ], |
| 164 | + "usage": { |
| 165 | + "prompt_tokens": 23, |
| 166 | + "total_tokens": 40, |
| 167 | + "completion_tokens": 17, |
| 168 | + "completion_tokens_details": null, |
| 169 | + "prompt_tokens_details": null |
| 170 | + }, |
| 171 | + "debug_info": null, |
| 172 | + "aux_info": null |
| 173 | +} |
| 174 | +``` |
| 175 | + |
| 176 | + |
| 177 | +You have successfully run a LLM chatbot with Arm optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts. |
| 178 | + |
0 commit comments