Skip to content

Commit 379ed09

Browse files
authored
Merge branch 'ArmDeveloperEcosystem:main' into main
2 parents c5a24a2 + 631edaf commit 379ed09

File tree

12 files changed

+634
-0
lines changed

12 files changed

+634
-0
lines changed
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
title: Deploy a RAG-based Chatbot with llama-cpp-python using KleidiAI on Arm Servers
3+
4+
minutes_to_complete: 45
5+
6+
who_is_this_for: This Learning Path is for software developers, ML engineers, and those looking to deploy production-ready LLM chatbots with RAG capabilities, knowledge base integration, and performance optimization for Arm Architecture.
7+
8+
learning_objectives:
9+
- Set up llama-cpp-python optimized for Arm servers.
10+
- Implement RAG architecture using the FAISS vector database.
11+
- Optimize model performance through 4-bit quantization.
12+
- Build a web interface for document upload and chat.
13+
- Monitor and analyze inference performance metrics.
14+
15+
prerequisites:
16+
- Basic understanding of Python and ML concepts.
17+
- Familiarity with REST APIs and web services.
18+
- Basic knowledge of vector databases.
19+
- Understanding of LLM fundamentals.
20+
21+
author_primary: Nobel Chowdary Mandepudi
22+
23+
### Tags
24+
skilllevels: Advanced
25+
armips:
26+
- Neoverse
27+
subjects: LLM
28+
operatingsystems:
29+
- Linux
30+
tools_software_languages:
31+
- Python
32+
- Streamlit
33+
34+
### FIXED, DO NOT MODIFY
35+
# ================================================================================
36+
weight: 1 # _index.md always has weight of 1 to order correctly
37+
layout: "learningpathall" # All files under learning paths have this same wrapper
38+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
39+
---
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
review:
3+
- questions:
4+
question: >
5+
What is the primary purpose of using RAG in an LLM chatbot?
6+
answers:
7+
- To reduce the size of the model.
8+
- To enhance the chatbot's responses with contextually-relevant information.
9+
- To increase the training speed of the model.
10+
- To simplify the deployment process.
11+
correct_answer: 2
12+
explanation: >
13+
RAG (Retrieval Augmented Generation) enhances the chatbot's responses by retrieving and incorporating contextually-relevant information from a vector database.
14+
15+
- questions:
16+
question: >
17+
Which framework is used to create the web interface for the RAG-based LLM server?
18+
answers:
19+
- Django.
20+
- Flask.
21+
- Streamlit.
22+
- FastAPI.
23+
correct_answer: 3
24+
explanation: >
25+
Streamlit is used to create the web interface for the RAG-based LLM server, allowing users to interact with the backend.
26+
27+
- questions:
28+
question: >
29+
What is the role of FAISS in the RAG-based LLM server?
30+
answers:
31+
- To train the LLM model.
32+
- To store and retrieve vectorized documents.
33+
- To handle HTTP requests.
34+
- To manage user authentication.
35+
correct_answer: 2
36+
explanation: >
37+
FAISS is used to store and retrieve vectorized documents, enabling the RAG-based LLM server to provide contextually relevant responses.
38+
39+
# ================================================================================
40+
# FIXED, DO NOT MODIFY
41+
# ================================================================================
42+
title: "Review" # Always the same title
43+
weight: 6 # Set to always be larger than the content in this path
44+
layout: "learningpathall" # All files under learning paths have this same wrapper
45+
---
Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
---
2+
title: Deploy a RAG-based LLM backend server
3+
weight: 3
4+
5+
layout: learningpathall
6+
---
7+
8+
## Backend Script for RAG-based LLM Server
9+
Once the virtual environment is activated, create a `backend.py` script using the following content. This script integrates the LLM with the FAISS vector database for RAG:
10+
11+
```python
12+
import os
13+
import time
14+
import logging
15+
from flask import Flask, request, jsonify
16+
from flask_cors import CORS
17+
from langchain_community.vectorstores import FAISS
18+
from langchain_community.embeddings import HuggingFaceEmbeddings
19+
from langchain_community.llms import LlamaCpp
20+
from langchain_core.callbacks import StreamingStdOutCallbackHandler
21+
from langchain_core.prompts import PromptTemplate
22+
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
23+
from langchain_text_splitters import HTMLHeaderTextSplitter, CharacterTextSplitter
24+
from langchain.schema.runnable import RunnablePassthrough
25+
from langchain_core.output_parsers import StrOutputParser
26+
from langchain_core.runnables import ConfigurableField
27+
28+
# Configure logging
29+
logging.getLogger('watchdog').setLevel(logging.ERROR)
30+
logger = logging.getLogger(__name__)
31+
32+
# Initialize Flask app
33+
app = Flask(__name__)
34+
CORS(app)
35+
36+
# Configure paths
37+
BASE_PATH = "/home/ubuntu"
38+
TEMP_DIR = os.path.join(BASE_PATH, "temp")
39+
VECTOR_DIR = os.path.join(BASE_PATH, "vector")
40+
MODEL_PATH = os.path.join(BASE_PATH, "models/llama3.1-8b-instruct.Q4_0_arm.gguf")
41+
42+
# Ensure directories exist
43+
os.makedirs(TEMP_DIR, exist_ok=True)
44+
os.makedirs(VECTOR_DIR, exist_ok=True)
45+
46+
# Token Streaming
47+
class StreamingCallback(StreamingStdOutCallbackHandler):
48+
def __init__(self):
49+
super().__init__()
50+
self.tokens = []
51+
self.start_time = None
52+
53+
def on_llm_start(self, *args, **kwargs):
54+
self.start_time = time.time()
55+
self.tokens = []
56+
print("\nLLM Started generating response...", flush=True)
57+
58+
def on_llm_new_token(self, token: str, **kwargs):
59+
self.tokens.append(token)
60+
print(token, end="", flush=True)
61+
62+
def on_llm_end(self, *args, **kwargs):
63+
end_time = time.time()
64+
duration = end_time - self.start_time
65+
print(f"\nLLM finished generating response in {duration:.2f} seconds", flush=True)
66+
67+
def format_docs(docs):
68+
return "\n\n".join(doc.page_content for doc in docs).replace("Context:", "").strip()
69+
70+
# Vectordb creating API
71+
@app.route('/create_vectordb', methods=['POST'])
72+
def create_vectordb():
73+
try:
74+
data = request.json
75+
vector_name = data['vector_name']
76+
chunk_size = int(data['chunk_size'])
77+
doc_type = data['doc_type']
78+
vector_path = os.path.join(VECTOR_DIR, vector_name)
79+
80+
# Process document
81+
chunk_overlap = 30
82+
if doc_type == "PDF":
83+
loader = DirectoryLoader(TEMP_DIR, glob='*.pdf', loader_cls=PyPDFLoader)
84+
docs = loader.load()
85+
elif doc_type == "HTML":
86+
url = data['url']
87+
splitter = HTMLHeaderTextSplitter([
88+
("h1", "Header 1"), ("h2", "Header 2"),
89+
("h3", "Header 3"), ("h4", "Header 4")
90+
])
91+
docs = splitter.split_text_from_url(url)
92+
else:
93+
return jsonify({"error": "Unsupported document type"}), 400
94+
95+
# Create vectorstore
96+
text_splitter = CharacterTextSplitter(
97+
chunk_size=chunk_size,
98+
chunk_overlap=chunk_overlap
99+
)
100+
split_docs = text_splitter.split_documents(docs)
101+
embedding = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
102+
vectorstore = FAISS.from_documents(documents=split_docs, embedding=embedding)
103+
vectorstore.save_local(vector_path)
104+
105+
return jsonify({"status": "success", "path": vector_path})
106+
except Exception as e:
107+
logger.exception("Error creating vector database")
108+
return jsonify({"error": str(e)}), 500
109+
110+
# Query API
111+
@app.route('/query', methods=['POST'])
112+
def query():
113+
try:
114+
data = request.json
115+
question = data['question']
116+
vector_path = data.get('vector_path')
117+
use_vectordb = data.get('use_vectordb', False)
118+
119+
# Initialize LLM
120+
callbacks = [StreamingCallback()]
121+
model = LlamaCpp(
122+
model_path=MODEL_PATH,
123+
temperature=0.1,
124+
max_tokens=1024,
125+
n_batch=2048,
126+
callbacks=callbacks,
127+
n_ctx=10000,
128+
n_threads=64,
129+
n_threads_batch=64
130+
)
131+
132+
# Create chain
133+
if use_vectordb and vector_path:
134+
embedding = HuggingFaceEmbeddings(model_name="thenlper/gte-base")
135+
vectorstore = FAISS.load_local(vector_path, embedding, allow_dangerous_deserialization=True)
136+
retriever = vectorstore.as_retriever().configurable_fields(
137+
search_kwargs=ConfigurableField(id="search_kwargs")
138+
).with_config({"search_kwargs": {"k": 5}})
139+
140+
template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
141+
You are a helpful assistant. Use the following context to answer the question.
142+
Context: {context}
143+
Question: {question}
144+
Answer: <|eot_id|>"""
145+
146+
prompt = PromptTemplate(template=template, input_variables=["context", "question"])
147+
chain = (
148+
{"context": retriever | format_docs, "question": RunnablePassthrough()}
149+
| prompt
150+
| model
151+
| StrOutputParser()
152+
)
153+
else:
154+
template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
155+
Question: {question}
156+
Answer: <|eot_id|>"""
157+
158+
prompt = PromptTemplate(template=template, input_variables=["question"])
159+
chain = RunnablePassthrough().assign(question=lambda x: x) | prompt | model | StrOutputParser()
160+
161+
# Generate response
162+
response = chain.invoke(question)
163+
return jsonify({"answer": response})
164+
except Exception as e:
165+
logger.exception("Error processing query")
166+
return jsonify({"error": str(e)}), 500
167+
168+
# File Upload API
169+
@app.route('/upload_file', methods=['POST'])
170+
def upload_file():
171+
try:
172+
file = request.files['file']
173+
if file and file.filename.endswith('.pdf'):
174+
filename = os.path.join(TEMP_DIR, "uploaded.pdf")
175+
file.save(filename)
176+
return jsonify({"status": "success", "path": filename})
177+
return jsonify({"error": "Invalid file"}), 400
178+
except Exception as e:
179+
logger.exception("Error uploading file")
180+
return jsonify({"error": str(e)}), 500
181+
182+
if __name__ == '__main__':
183+
app.run(host='0.0.0.0', port=5000, debug=True)
184+
```
185+
186+
## Run the Backend Server
187+
188+
You are now ready to run the backend server for the RAG Chatbot.
189+
Use the following command in a terminal to start the backend server:
190+
191+
```python
192+
python3 backend.py
193+
```
194+
195+
You should see output similar to the image below when the backend server starts successfully:
196+
![backend](backend_output.png)
46.3 KB
Loading
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
title: The RAG Chatbot and its Performance
3+
weight: 5
4+
5+
layout: learningpathall
6+
---
7+
8+
## Access the Web Application
9+
10+
Open the web application in your browser using either the local URL or the external URL:
11+
12+
```bash
13+
http://localhost:8501 or http://75.101.253.177:8501
14+
```
15+
16+
## Upload a PDF File and Create a New Index
17+
18+
Now you can upload a PDF file in the web browser by selecting the **Create New Store** option.
19+
20+
Follow these steps to create a new index:
21+
22+
1. Open the web browser and navigate to the Streamlit frontend.
23+
2. In the sidebar, select **Create New Store** under the **Vector Database** section.
24+
3. By default, **PDF** is the source type selected.
25+
4. Upload your PDF file using the file uploader.
26+
5. Enter a name for your vector index.
27+
6. Click the **Create Index** button.
28+
29+
Upload the Cortex-M processor comparison document, which can be downloaded from [this website](https://developer.arm.com/documentation/102787/latest/).
30+
31+
You should see a confirmation message indicating that the vector index has been created successfully. Refer to the image below for guidance:
32+
33+
![RAG_IMG1](rag_img1.png)
34+
35+
## Load Existing Store
36+
37+
After creating the index, you can switch to the **Load Existing Store** option and then select the index you created earlier. Initially, it will be the only available index and will be auto-selected.
38+
39+
Follow these steps:
40+
41+
1. Switch to the **Load Existing Store** option in the sidebar.
42+
2. Select the index you created. It should be auto-selected if it's the only one available.
43+
44+
This will allow you to use the uploaded document for generating contextually-relevant responses. Refer to the image below for guidance:
45+
46+
![RAG_IMG2](rag_img2.png)
47+
48+
## Interact with the LLM
49+
50+
You can now start asking various queries to the LLM using the prompt in the web application. The responses will be streamed both to the frontend and the backend server terminal.
51+
52+
Follow these steps:
53+
54+
1. Enter your query in the prompt field of the web application.
55+
2. Submit the query to receive a response from the LLM.
56+
57+
![RAG_IMG3](rag_img3.png)
58+
59+
While the response is streamed to the frontend for immediate viewing, you can monitor the performance metrics on the backend server terminal. This gives you insights into the processing speed and efficiency of the LLM.
60+
61+
![RAG_IMG4](rag_img4.png)
62+
63+
## Observe Performance Metrics
64+
65+
As shown in the image above, the RAG LLM Chatbot completed the generation in 4.65 seconds, processing and generating a total count of tokens as `1183`.
66+
67+
This demonstrates the efficiency and speed of the RAG LLM Chatbot in handling queries and generating responses.
68+
69+
## Further Interaction and Custom Applications
70+
71+
You can continue to ask follow-up prompts and observe the performance metrics in the backend terminal.
72+
73+
This setup demonstrates how you can create various applications and configure your LLM backend connected to RAG for custom text generation with specific documents. This Learning Path serves as a guide and example to showcase the LLM inference of RAG on Arm CPUs, highlighting the optimized performance gains.
74+
75+
76+

0 commit comments

Comments
 (0)