Skip to content

Commit af49cac

Browse files
authored
Merge pull request #1570 from madeline-underwood/Google_RAG_Final
Google RAG Final_AP to approve
2 parents 7f6ae9a + 5c607ae commit af49cac

File tree

5 files changed

+21
-18
lines changed

5 files changed

+21
-18
lines changed

content/learning-paths/servers-and-cloud-computing/rag/_demo.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,21 @@
11
---
22
title: Run a llama.cpp chatbot powered by Arm Kleidi technology
3+
weight: 2
34

45
overview: |
5-
This Arm learning path shows how to use a single c4a-highcpu-72 Google Axion instance -- powered by an Arm Neoverse CPU -- to build a simple "Token as a Service" RAG-enabled server, used below to provide a chatbot to serve a small number of concurrent users.
6+
This Learning Path shows you how to use a c4a-highcpu-72 Google Axion instance powered by an Arm Neoverse CPU to build a simple Token-as-a-Service (TaaS) RAG-enabled server that you can then use to provide a chatbot to serve a small number of concurrent users.
67
7-
This architecture would be suitable for businesses looking to deploy the latest Generative AI technologies with RAG capabilities using their existing CPU compute capacity and deployment pipelines. It enables semantic search over chunked documents using FAISS vector store. The demo uses the open source llama.cpp framework, which Arm has enhanced by contributing the latest Arm Kleidi technologies. Further optimizations are achieved by using the smaller 8 billion parameter Llama 3.1 model, which has been quantized to optimize memory usage.
8+
This architecture is suitable for businesses looking to deploy the latest Generative AI technologies with RAG capabilities using their existing CPU compute capacity and deployment pipelines.
9+
10+
It enables semantic search over chunked documents using the FAISS vector store. The demo uses the open source llama.cpp framework, which Arm has enhanced with its own Kleidi technologies. Further optimizations are achieved by using the smaller 8 billion parameter Llama 3.1 model, which has been quantized to optimize memory usage.
811
9-
Chat with the Llama-3.1-8B RAG-enabled LLM below to see the performance for yourself, then follow the learning path to build your own Generative AI service on Arm Neoverse.
12+
Chat with the Llama-3.1-8B RAG-enabled LLM below to see the performance for yourself, and then follow the Learning Path to build your own Generative AI service on Arm Neoverse.
1013
1114
1215
demo_steps:
13-
- Type & send a message to the chatbot.
16+
- Type and send a message to the chatbot.
1417
- Receive the chatbot's reply, including references from RAG data.
15-
- View stats showing how well Google Axion runs LLMs.
18+
- View performance statistics demonstrating how well Google Axion runs LLMs.
1619

1720
diagram: config-diagram-dark.png
1821
diagram_blowup: config-diagram.png

content/learning-paths/servers-and-cloud-computing/rag/backend.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Deploy a RAG-based LLM backend server
3-
weight: 3
3+
weight: 4
44

55
layout: learningpathall
66
---

content/learning-paths/servers-and-cloud-computing/rag/chatbot.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: The RAG Chatbot and its Performance
3-
weight: 5
3+
weight: 6
44

55
layout: learningpathall
66
---
@@ -15,9 +15,9 @@ http://[your instance ip]:8501
1515

1616
{{% notice Note %}}
1717

18-
To access the links you may need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they may introduce security vulnerabilities.
18+
To access the links you might need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they might introduce security vulnerabilities.
1919

20-
For an Axion instance, this can be done as follows from the gcloud cli:
20+
For an Axion instance, you can do this from the gcloud cli:
2121

2222
gcloud compute firewall-rules create allow-my-ip \
2323
--direction=INGRESS \
@@ -43,7 +43,7 @@ Follow these steps to create a new index:
4343
5. Enter a name for your vector index.
4444
6. Click the **Create Index** button.
4545

46-
Upload the Cortex-M processor comparison document, which can be downloaded from [this website](https://developer.arm.com/documentation/102787/latest/).
46+
Upload the Cortex-M processor comparison document, which can be downloaded from [the Arm developer website](https://developer.arm.com/documentation/102787/latest/).
4747

4848
You should see a confirmation message indicating that the vector index has been created successfully. Refer to the image below for guidance:
4949

@@ -56,15 +56,15 @@ After creating the index, you can switch to the **Load Existing Store** option a
5656
Follow these steps:
5757

5858
1. Switch to the **Load Existing Store** option in the sidebar.
59-
2. Select the index you created. It should be auto-selected if it's the only one available.
59+
2. Select the index you created. It should be auto-selected if it is the only one available.
6060

61-
This will allow you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance:
61+
This allows you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance:
6262

6363
![RAG_IMG2](rag_img2.png)
6464

6565
## Interact with the LLM
6666

67-
You can now start asking various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal.
67+
You can now start issuing various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal.
6868

6969
Follow these steps:
7070

@@ -73,7 +73,7 @@ Follow these steps:
7373

7474
![RAG_IMG3](rag_img3.png)
7575

76-
While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This gives you insights into the processing speed and efficiency of the LLM.
76+
While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This provides insights into the processing speed and efficiency of the LLM.
7777

7878
![RAG_IMG4](rag_img4.png)
7979

content/learning-paths/servers-and-cloud-computing/rag/frontend.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Deploy RAG-based LLM frontend server
3-
weight: 4
3+
weight: 5
44

55
layout: learningpathall
66
---

content/learning-paths/servers-and-cloud-computing/rag/rag_llm.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,15 @@
22
# User change
33
title: "Set up a RAG based LLM Chatbot"
44

5-
weight: 2 # 1 is first, 2 is second, etc.
5+
weight: 3
66

77
# Do not modify these elements
88
layout: "learningpathall"
99
---
1010

1111
## Before you begin
1212

13-
This learning path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM's capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores, 8GB of RAM, and a 32GB disk to run this example. The instructions have been tested on a GCP c4a-standard-64 instance.
13+
This Learning Path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM's capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores, 8GB of RAM, and a 32GB disk to run this example. The instructions have been tested on a GCP c4a-standard-64 instance.
1414

1515
## Overview
1616

@@ -100,7 +100,7 @@ Download the Hugging Face model:
100100
wget https://huggingface.co/chatpdflocal/llama3.1-8b-gguf/resolve/main/ggml-model-Q4_K_M.gguf
101101
```
102102

103-
## Build llama.cpp & Quantize the Model
103+
## Build llama.cpp and Quantize the Model
104104

105105
Navigate to your home directory:
106106

0 commit comments

Comments
 (0)