You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/rag/_demo.md
+8-5Lines changed: 8 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,21 @@
1
1
---
2
2
title: Run a llama.cpp chatbot powered by Arm Kleidi technology
3
+
weight: 2
3
4
4
5
overview: |
5
-
This Arm learning path shows how to use a single c4a-highcpu-72 Google Axion instance -- powered by an Arm Neoverse CPU -- to build a simple "Token as a Service" RAG-enabled server, used below to provide a chatbot to serve a small number of concurrent users.
6
+
This Learning Path shows you how to use a c4a-highcpu-72 Google Axion instance powered by an Arm Neoverse CPU to build a simple Token-as-a-Service (TaaS) RAG-enabled server that you can then use to provide a chatbot to serve a small number of concurrent users.
6
7
7
-
This architecture would be suitable for businesses looking to deploy the latest Generative AI technologies with RAG capabilities using their existing CPU compute capacity and deployment pipelines. It enables semantic search over chunked documents using FAISS vector store. The demo uses the open source llama.cpp framework, which Arm has enhanced by contributing the latest Arm Kleidi technologies. Further optimizations are achieved by using the smaller 8 billion parameter Llama 3.1 model, which has been quantized to optimize memory usage.
8
+
This architecture is suitable for businesses looking to deploy the latest Generative AI technologies with RAG capabilities using their existing CPU compute capacity and deployment pipelines.
9
+
10
+
It enables semantic search over chunked documents using the FAISS vector store. The demo uses the open source llama.cpp framework, which Arm has enhanced with its own Kleidi technologies. Further optimizations are achieved by using the smaller 8 billion parameter Llama 3.1 model, which has been quantized to optimize memory usage.
8
11
9
-
Chat with the Llama-3.1-8B RAG-enabled LLM below to see the performance for yourself, then follow the learning path to build your own Generative AI service on Arm Neoverse.
12
+
Chat with the Llama-3.1-8B RAG-enabled LLM below to see the performance for yourself, and then follow the Learning Path to build your own Generative AI service on Arm Neoverse.
10
13
11
14
12
15
demo_steps:
13
-
- Type & send a message to the chatbot.
16
+
- Type and send a message to the chatbot.
14
17
- Receive the chatbot's reply, including references from RAG data.
15
-
- View stats showing how well Google Axion runs LLMs.
18
+
- View performance statistics demonstrating how well Google Axion runs LLMs.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/rag/chatbot.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: The RAG Chatbot and its Performance
3
-
weight: 5
3
+
weight: 6
4
4
5
5
layout: learningpathall
6
6
---
@@ -15,9 +15,9 @@ http://[your instance ip]:8501
15
15
16
16
{{% notice Note %}}
17
17
18
-
To access the links you may need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they may introduce security vulnerabilities.
18
+
To access the links you might need to allow inbound TCP traffic in your instance's security rules. Always review these permissions with caution as they might introduce security vulnerabilities.
19
19
20
-
For an Axion instance, this can be done as follows from the gcloud cli:
20
+
For an Axion instance, you can do this from the gcloud cli:
@@ -43,7 +43,7 @@ Follow these steps to create a new index:
43
43
5. Enter a name for your vector index.
44
44
6. Click the **Create Index** button.
45
45
46
-
Upload the Cortex-M processor comparison document, which can be downloaded from [this website](https://developer.arm.com/documentation/102787/latest/).
46
+
Upload the Cortex-M processor comparison document, which can be downloaded from [the Arm developer website](https://developer.arm.com/documentation/102787/latest/).
47
47
48
48
You should see a confirmation message indicating that the vector index has been created successfully. Refer to the image below for guidance:
49
49
@@ -56,15 +56,15 @@ After creating the index, you can switch to the **Load Existing Store** option a
56
56
Follow these steps:
57
57
58
58
1. Switch to the **Load Existing Store** option in the sidebar.
59
-
2. Select the index you created. It should be auto-selected if it's the only one available.
59
+
2. Select the index you created. It should be auto-selected if it is the only one available.
60
60
61
-
This will allow you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance:
61
+
This allows you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance:
62
62
63
63

64
64
65
65
## Interact with the LLM
66
66
67
-
You can now start asking various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal.
67
+
You can now start issuing various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal.
68
68
69
69
Follow these steps:
70
70
@@ -73,7 +73,7 @@ Follow these steps:
73
73
74
74

75
75
76
-
While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This gives you insights into the processing speed and efficiency of the LLM.
76
+
While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This provides insights into the processing speed and efficiency of the LLM.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/rag/rag_llm.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,15 +2,15 @@
2
2
# User change
3
3
title: "Set up a RAG based LLM Chatbot"
4
4
5
-
weight: 2# 1 is first, 2 is second, etc.
5
+
weight: 3
6
6
7
7
# Do not modify these elements
8
8
layout: "learningpathall"
9
9
---
10
10
11
11
## Before you begin
12
12
13
-
This learning path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM's capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores, 8GB of RAM, and a 32GB disk to run this example. The instructions have been tested on a GCP c4a-standard-64 instance.
13
+
This Learning Path demonstrates how to build and deploy a Retrieval Augmented Generation (RAG) enabled chatbot using open-source Large Language Models (LLMs) optimized for Arm architecture. The chatbot processes documents, stores them in a vector database, and generates contextually-relevant responses by combining the LLM's capabilities with retrieved information. The instructions in this Learning Path have been designed for Arm servers running Ubuntu 22.04 LTS. You need an Arm server instance with at least 16 cores, 8GB of RAM, and a 32GB disk to run this example. The instructions have been tested on a GCP c4a-standard-64 instance.
14
14
15
15
## Overview
16
16
@@ -100,7 +100,7 @@ Download the Hugging Face model:
0 commit comments