Merge pull request #1377 from madeline-underwood/RTP-LLM-chatbot

jasonrandrews · web-flow · commit 7a8c68747d0e · 2024-11-12T09:31:41.000-06:00
Rtp llm chatbot_KB to review
diff --git a/content/learning-paths/servers-and-cloud-computing/rtp-llm/_index.md b/content/learning-paths/servers-and-cloud-computing/rtp-llm/_index.md
@@ -1,17 +1,18 @@
 ---
-title: Run a Large Language Model (LLM) chatbot with rtp-llm on Arm servers
+title: Run an LLM chatbot with rtp-llm on Arm-based servers
 
 minutes_to_complete: 30
 
-who_is_this_for: This is an introductory topic for developers interested in running LLMs on Arm-based servers. 
+who_is_this_for: This is an introductory topic for developers who are interested in running a Large Language Model (LLM) with rtp-llm on Arm-based servers. 
 
 learning_objectives:
-    - Build rtp-llm on your Arm server.
+    - Build rtp-llm on an Arm-based server.
     - Download a Qwen model from Hugging Face.
     - Run a Large Language Model with rtp-llm.
 
 prerequisites:
-    - An Arm Neoverse N2 or Neoverse V2 [based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server. This Learning Path was tested on an AliCloud Yitian710 g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance to test Arm performance optimizations.
+    - Any Arm Neoverse N2-based or Arm Neoverse V2-based instance running Ubuntu 22.04 LTS from a cloud service provider or an on-premise Arm server. 
+    - For the server, at least four cores and 16GB of RAM, with disk storage configured up to at least 32 GB. 
 
 author_primary: Tianyu Li
 
diff --git a/content/learning-paths/servers-and-cloud-computing/rtp-llm/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/rtp-llm/_next-steps.md
@@ -4,7 +4,12 @@ next_step_guidance: >
 
 recommended_path: "/learning-paths/servers-and-cloud-computing/nlp-hugging-face/"
 
+
 further_reading:
+    - resource: 
+        title: Qwen2-0.5B-Instruct
+        link: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
+        type: website
     - resource:
         title: Getting started with RTP-LLM
         link: https://github.com/alibaba/rtp-llm
@@ -18,9 +23,10 @@ further_reading:
         link: https://blogs.oracle.com/ai-and-datascience/post/democratizing-generative-ai-with-cpu-based-inference
         type: blog
     - resource: 
-        title: Qwen2-0.5B-Instruct
-        link: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct
+        title: Get started with Arm-based cloud instances
+        link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/
         type: website
+     
 
 
 # ================================================================================
diff --git a/content/learning-paths/servers-and-cloud-computing/rtp-llm/_review.md b/content/learning-paths/servers-and-cloud-computing/rtp-llm/_review.md
@@ -2,23 +2,33 @@
 review:
     - questions:
         question: >
-            Can you run LLMs on Arm CPUs?
+            Are at least four cores, 16GB of RAM, and 32GB of disk storage required to run the LLM chatbot using rtp-llm on an Arm-based server?
         answers:
             - "Yes"
             - "No"
         correct_answer: 1
         explanation: >
-            Yes. The advancements made in the Generative AI space with smaller parameter models make LLM inference on CPUs very efficient.
+            It depends on the size of the LLM. The higher the number of parameters of the model, the greater the system requirements. 
 
     - questions:
         question: >
-            Can rtp-llm be built and run on CPU?
+            Does the rtp-llm project use the --config=arm option to optimize LLM inference for Arm CPUs?
         answers:
             - "Yes"
             - "No"
         correct_answer: 1
         explanation: >
-            Yes. rtp-llm not only support built and run on GPU, but also it can be run on Arm CPU.
+            rtp-llm uses the GPU for inference by default. rtp-llm optimizes LLM inference on Arm architecture by providing a configuration option --config=arm during the build process.
+
+    - questions:
+        question: >
+            Is the given Python script the only way to run the LLM chatbot on an Arm AArch64 CPU and output a response from the model?
+        answers:
+            - "Yes"
+            - "No"
+        correct_answer: 2
+        explanation: >
+            rtp-llm can also be deployed as an API server, and the user can use curl or another client to generate an LLM chatbot response.
 
 # ================================================================================
 #       FIXED, DO NOT MODIFY
diff --git a/content/learning-paths/servers-and-cloud-computing/rtp-llm/overview.md b/content/learning-paths/servers-and-cloud-computing/rtp-llm/overview.md
@@ -0,0 +1,33 @@
+---
+title: Background
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+Arm CPUs are widely used in ML and AI use cases. In this Learning Path, you will learn how to run the generative AI inference-based use case of an LLM chatbot on an Arm-based CPU. You will do this by deploying the [Qwen2-0.5B-Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on an Arm-based CPU using `rtp-llm`.
+
+
+{{% notice Note %}}
+This Learning Path has been tested on an Alibaba Cloud g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance.
+{{% /notice %}}
+
+
+[rtp-llm](https://github.com/alibaba/rtp-llm) is an open-source C/C++ project developed by Alibaba that enables efficient LLM inference on a variety of hardware. 
+
+RTP-LLM is a Large Language Model inference acceleration engine developed by Alibaba. Qwen is the name given to a series of Large Language Models developed by Alibaba Cloud that are capable of performing a variety of tasks. 
+
+Alibaba Cloud offer a wide range of models, each suitable for different tasks and use cases. 
+
+Besides generating text, they are also able to perform actions such as:
+
+* Answering questions, through information retrieval, and analysis.
+* Processing images, and producing written descriptions of visual content.
+* Processing audio content.
+* Provide multilingual support, with over 27 additional languages, on top of the core languages of English and Chinese.
+
+Qwen is open source, flexible, and encourages contribution from the software development community. 
+
+
+
+ 
diff --git a/content/learning-paths/servers-and-cloud-computing/rtp-llm/rtp-llm-chatbot.md b/content/learning-paths/servers-and-cloud-computing/rtp-llm/rtp-llm-chatbot.md
@@ -1,23 +1,13 @@
 ---
-title: Run a Large Language model (LLM) chatbot with rtp-llm on Arm servers
+title: Run an LLM chatbot with rtp-llm on an Arm server
 weight: 3
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
-
-## Before you begin
-The instructions in this Learning Path are for any Arm Neoverse N2 or Neoverse V2 based server running Ubuntu 22.04 LTS. You need an Arm server instance with at least four cores and 16GB of RAM to run this example. Configure disk storage up to at least 32 GB. The instructions have been tested on an Alibaba Cloud g8y.8xlarge instance and an AWS Graviton4 r8g.8xlarge instance.
-
-## Overview
-
-Arm CPUs are widely used in traditional ML and AI use cases. In this Learning Path, you will learn how to run generative AI inference-based use case like a LLM chatbot on Arm-based CPUs. You do this by deploying the [Qwen2-0.5B-Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on your Arm-based CPU using `rtp-llm`.
-
-[rtp-llm](https://github.com/alibaba/rtp-llm) is an open source C/C++ project developed by Alibaba that enables efficient LLM inference on a variety of hardware. 
- 
 ## Install dependencies 
 
-Install `micromamba` to setup python 3.10 at path `/opt/conda310`, required by `rtp-llm` build system:
+Install `micromamba` to set up python 3.10 at path `/opt/conda310`, as required by the `rtp-llm` build system:
 
 ```bash
 "${SHELL}" <(curl -L micro.mamba.pm/install.sh)
@@ -34,14 +24,14 @@ chmod +x bazelisk-linux-arm64
 sudo mv bazelisk-linux-arm64 /usr/bin/bazelisk
 ```
 
-Install `git/gcc/g++` on your machine:
+Install `git/gcc/g++`:
 
 ```bash
 sudo apt install git -y
 sudo apt install build-essential -y
 ```
 
-Install `openblas` developmwnt package and fix the header paths:
+Install the `openblas` development package and fix the header paths:
 
 ```bash
 sudo apt install libopenblas-dev
@@ -53,28 +43,28 @@ sudo ln -sf /usr/include/aarch64-linux-gnu/cblas.h /usr/include/openblas/cblas.h
 
 You are now ready to start building `rtp-llm`. 
 
-Clone the source repository for rtp-llm:
+Start by cloning the source repository for rtp-llm:
 
 ```bash
 git clone https://github.com/alibaba/rtp-llm
 cd rtp-llm
 git checkout 4656265
 ```
 
-Comment out the lines 7-10 in `deps/requirements_lock_torch_arm.txt` as some hosts are not accessible from the Internet.
+Next, comment out lines 7-10 in `deps/requirements_lock_torch_arm.txt` as some hosts are not accessible from the web:
 
 ```bash
 sed -i '7,10 s/^/#/' deps/requirements_lock_torch_arm.txt
 ```
 
-By default, `rtp-llm` builds for GPU only on Linux. You need to provide extra config `--config=arm` to build it for the Arm CPU that you will run it on:
+By default, `rtp-llm` builds for GPU only on Linux. You need to provide the additional flag `--config=arm` to build it for the Arm CPU that you will run it on.
 
 Configure and build:
 
 ```bash
 bazelisk build --config=arm //maga_transformer:maga_transformer_aarch64
 ```
-The output from your build should look like:
+The output from your build should look like this:
 
 ```output
 INFO: 10094 processes: 8717 internal, 1377 local.
@@ -87,7 +77,7 @@ Install the built wheel package:
 pip install bazel-bin/maga_transformer/maga_transformer-0.2.0-cp310-cp310-linux_aarch64.whl
 ```
 
-Create a file named `python-test.py` in your `/tmp` directory with the contents below: 
+Create a file named `python-test.py` in your `/tmp` directory with the contents shown below: 
 
 ```python
 from maga_transformer.pipeline import Pipeline
@@ -140,7 +130,9 @@ Now run this file:
 python /tmp/python-test.py
 ```
 
-If `rtp-llm` has built correctly on your machine, you will see the LLM model response for the prompt input. A snippet of the output is shown below:
+If `rtp-llm` has built correctly on your machine, you will see the LLM model response for the prompt input. 
+
+A snippet of the output is shown below:
 
 ```output
 ['I am a large language model created by Alibaba Cloud. My name is Qwen.']
@@ -174,5 +166,7 @@ If `rtp-llm` has built correctly on your machine, you will see the LLM model res
 ```
 
 
-You have successfully run a LLM chatbot with Arm optimizations, all running on your Arm AArch64 CPU on your server. You can continue experimenting and trying out the model with different prompts.
+You have successfully run a LLM chatbot with Arm optimizations, running on an Arm AArch64 CPU on your server. 
+
+You can continue to experiment with the chatbot by trying out different prompts on the model.
 
diff --git a/content/learning-paths/servers-and-cloud-computing/rtp-llm/rtp-llm-server.md b/content/learning-paths/servers-and-cloud-computing/rtp-llm/rtp-llm-server.md
@@ -5,25 +5,32 @@ weight: 4
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
+## Setup
 
-You can use the `rtp-llm` server program and submit requests using an OpenAI-compatible API.
-This enables applications to be created which access the LLM multiple times without starting and stopping it. You can also access the server over the network to another machine hosting the LLM.
+You can now move on to using the `rtp-llm` server program and submitting requests using an OpenAI-compatible API.
 
-One additional software package is required for this section. Install `jq` on your computer using:
+This enables applications to be created which access the LLM multiple times without starting and stopping it. 
+
+You can also access the server over the network to another machine hosting the LLM.
+
+One additional software package is required for this section. 
+
+Install `jq` on your computer using the following commands:
 
 ```bash
 sudo apt install jq -y
 ```
 
-# Running the Server
-## Install Hugging Face Hub
+## Running the Server
 
-There are a few different ways you can download the Qwen2 0.5B model. In this Learning Path, you download the model from Hugging Face.
+There are a few different ways you can download the Qwen2 0.5B model. In this Learning Path, you will download the model from Hugging Face.
 
-[Hugging Face](https://huggingface.co/) is an open source AI community where you can host your own AI models, train them and collaborate with others in the community. You can browse through the thousands of models that are available for a variety of use cases like NLP, audio, and computer vision.
+[Hugging Face](https://huggingface.co/) is an open source AI community where you can host your own AI models, train them, and collaborate with others in the community. You can browse through thousands of models that are available for a variety of use cases such as Natural Language Processing (NLP), audio, and computer vision.
 
 The `huggingface_hub` library provides APIs and tools that let you easily download and fine-tune pre-trained models. You will use `huggingface-cli` to download the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct).
 
+## Install Hugging Face Hub
+
 Install the required Python packages:
 
 ```bash
@@ -51,14 +58,18 @@ You can now download the model using the huggingface cli:
 huggingface-cli download Qwen/Qwen2-0.5B-Instruct
 ```
 
-## Start rtp-llm server
-The server executable has already compiled during the stage detailed in the previous section, when you ran `bazelisk build`. Install the pip wheel in your active virtual environment:
+## Start the rtp-llm server
+
+{{% notice Note %}}
+The server executable compiled during the previous stage, when you ran `bazelisk build`. {{% /notice %}}
+
+Install the pip wheel in your active virtual environment:
 
 ```bash
 pip install bazel-bin/maga_transformer/maga_transformer-0.2.0-cp310-cp310-linux_aarch64.whl
 pip install grpcio-tools
 ```
-Start the server from the command line, it listens on port 8088:
+Start the server from the command line. It listens on port 8088:
 
 ```bash
 export CHECKPOINT_PATH=${HOME}/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B-Instruct/snapshots/c540970f9e29518b1d8f06ab8b24cba66ad77b6d/
@@ -67,8 +78,9 @@ export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
 MODEL_TYPE=qwen_2 FT_SERVER_TEST=1 python3 -m maga_transformer.start_server
 ```
 
-# Client
-## Use curl
+## Client
+
+### Using curl
 
 You can access the API using the `curl` command. 
 
@@ -90,15 +102,15 @@ curl http://localhost:8088/v1/chat/completions -H "Content-Type: application/jso
   }' 2>/dev/null | jq -C
 ```
 
-The `model` value in the API is not used, you can enter any value. This is because there is only one model loaded in the server. 
+The `model` value in the API is not used, and you can enter any value. This is because there is only one model loaded in the server. 
 
 Run the script:
 
 ```bash
 bash ./curl-test.sh
 ```
 
-The `curl` command accesses the LLM and you see the output:
+The `curl` command accesses the LLM and you should see the output:
 
 ```output
 {
@@ -124,9 +136,9 @@ The `curl` command accesses the LLM and you see the output:
 }
 ```
 
-In the returned JSON data you see the LLM output, including the content created from the prompt. 
+In the returned JSON data, you will see the LLM output, including the content created from the prompt. 
 
-## Use Python
+### Using Python
 
 You can also use a Python program to access the OpenAI-compatible API.
 
@@ -165,13 +177,13 @@ for chunk in completion:
   print(chunk.choices[0].delta.content or "", end="")
 ```
 
-Run the Python file (make sure the server is still running):
+Ensure that the server is still running, and then run the Python file:
 
 ```bash
 python ./python-test.py
 ```
 
-You see the output generated by the LLM:
+You should see the output generated by the LLM:
 
 ```output
 Sure, here's a simple C++ program that prints "Hello, World!" to the console:
@@ -187,4 +199,4 @@ int main() {
 This program includes the `iostream` library, which is used for input/output operations. The `main` function is the entry point of the program, and it calls the `cout` object to print the message "Hello, World!" to the console.
 ```
 
-You can continue to experiment with different large language models and write scripts to access them.
+Now you can continue to experiment with different large language models, and have a go at writing scripts to access them.