Skip to content
This repository was archived by the owner on May 20, 2025. It is now read-only.

Commit 23fbfa8

Browse files
add rag guide
1 parent a57fdf3 commit 23fbfa8

File tree

1 file changed

+290
-0
lines changed

1 file changed

+290
-0
lines changed

docs/guides/python/llama-rag.mdx

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
---
2+
description: 'Transform your LLMs with Retrieval Augmented Generation'
3+
tags:
4+
- API
5+
- AI & Machine Learning
6+
languages:
7+
- python
8+
published_at: 2024-11-16
9+
updated_at: 2024-11-16
10+
---
11+
12+
# Using Retrieval Augmented Generation to enhance your LLMs
13+
14+
This guide shows how to use Retrieval Augmented Generation (RAG) to enhance a large language model (LLM). RAG is the process of enabling an LLM to reference context outside of its intial training data before generating its response. It can be extremely expensive in both time and computing power to train a model that is useful for your own domain-specific purposes. Therefore, using RAG is a cost-effective alternative to extending the capabilities of an existing LLM.
15+
16+
## Prerequisites
17+
18+
- [uv](https://docs.astral.sh/uv/#getting-started) - for Python dependency management
19+
- The [Nitric CLI](/get-started/installation)
20+
- _(optional)_ An [AWS](https://aws.amazon.com) account
21+
22+
## Getting started
23+
24+
We'll start by creating a new project using Nitric's python starter template.
25+
26+
<Note>
27+
If you want to take a look at the finished code, it can be found
28+
[here](https://github.com/nitrictech/examples/tree/main/v1/llama-rag).
29+
</Note>
30+
31+
```bash
32+
nitric new llama-rag py-starter
33+
cd llama-rag
34+
```
35+
36+
Next, let's install our base dependencies, then add the `llama-index` libraries. We'll be using [llama index](https://docs.llamaindex.ai/en/stable/) as it makes creating RAGs extremely simple and has support for running our own local Llama 3.2 models.
37+
38+
```bash
39+
# Install the base dependencies
40+
uv sync
41+
# Add Llama index dependencies
42+
uv add llama-index llama-index-embeddings-huggingface llama-index-llama-cpp
43+
```
44+
45+
We'll organize our project structure like so:
46+
47+
```text
48+
+--common/
49+
| +-- __init__.py
50+
| +-- model_parameters.py
51+
+--model/
52+
| +-- Llama-3.2-1B-Instruct-Q4_K_M.gguf
53+
+--services/
54+
| +-- api.py
55+
+--.gitignore
56+
+--.python-version
57+
+-- build_query_engine.py
58+
+-- pyproject.toml
59+
+-- python.dockerfile
60+
+-- python.dockerignore
61+
+-- nitric.yaml
62+
+-- README.md
63+
```
64+
65+
## Setting up our LLM
66+
67+
Before we even start writing code for our LLM we'll want to download the model into our project. For this project we'll be using Llama 3.2 with a Q4_K_M quant.
68+
69+
```bash
70+
mkdir model
71+
cd model
72+
curl -OL https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf
73+
cd ..
74+
```
75+
76+
Now that we have our model we can load it into our code. We'll also define our [embed model](https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/) - for vectorising our documentation - using a recommend [embed model](https://huggingface.co/BAAI/bge-large-en-v1.5) from Hugging Face. At this point we can also create a prompt template for prompts with our query engine. It will just sanitise some of the hallucinations so that if the model does not know an answer it won't pretend like it does.
77+
78+
```python title:common/model_paramters.py
79+
from llama_index.core import ChatPromptTemplate
80+
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
81+
from llama_index.llms.llama_cpp import LlamaCPP
82+
83+
84+
# Load the locally stored Llama model
85+
llm = LlamaCPP(
86+
model_url=None,
87+
model_path="./model/Llama-3.2-1B-Instruct-Q4_K_M.gguf",
88+
temperature=0.7,
89+
verbose=False,
90+
)
91+
92+
# Load the embed model from hugging face
93+
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5", trust_remote_code=True)
94+
95+
# Set the location that we will persist our embeds
96+
persist_dir = "query_engine_vectors"
97+
98+
# Create the prompt query templates to sanitise hallucinations
99+
text_qa_template = ChatPromptTemplate.from_messages([
100+
(
101+
"system",
102+
"If the context is not useful, respond with 'I'm not sure'.",
103+
),
104+
(
105+
"user",
106+
(
107+
"Context information is below.\n"
108+
"---------------------\n"
109+
"{context_str}\n"
110+
"---------------------\n"
111+
"Given the context information and not prior knowledge "
112+
"answer the question: {query_str}\n."
113+
)
114+
),
115+
])
116+
```
117+
118+
## Building a Query Engine
119+
120+
The next step is where we embed our context into the LLM. For this example we can embed the Nitric documentation to allow searchability using the LLM. It's open-source on [GitHub](https://github.com/nitrictech/docs), so we can clone it into our project.
121+
122+
```bash
123+
git clone https://github.com/nitrictech/docs.git nitric-docs
124+
```
125+
126+
We can then create our embedding and store it locally.
127+
128+
```python title:build_query_engine.py
129+
from common.model_parameters import llm, embed_model, persist_dir
130+
131+
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
132+
133+
134+
# Set global settings for llama index
135+
Settings.llm = llm
136+
Settings.embed_model = embed_model
137+
138+
# Load data from the documents directory
139+
loader = SimpleDirectoryReader(
140+
# The location of the documents you want to embed
141+
input_dir = "./nitric-docs/",
142+
# Set the extension to what format your documents are in
143+
required_exts=[".mdx"],
144+
# Search through documents recursively
145+
recursive=True
146+
)
147+
docs = loader.load_data()
148+
149+
# Embed the docs into the Llama model
150+
index = VectorStoreIndex.from_documents(docs, show_progress=True)
151+
152+
# Save the query engine index to the local machine
153+
index.storage_context.persist(persist_dir)
154+
```
155+
156+
You can then run this using the following command. This should output the embeds into your `persist_dir`.
157+
158+
```bash
159+
uv run build_query_engine.py
160+
```
161+
162+
## Creating an API for querying our model
163+
164+
With our LLM ready for querying, we can create an API to handle prompts.
165+
166+
```python
167+
import os
168+
169+
from common.model_parameters import embed_model, llm, text_qa_template, persist_dir
170+
171+
from nitric.resources import api
172+
from nitric.context import HttpContext
173+
from nitric.application import Nitric
174+
from llama_index.core import StorageContext, load_index_from_storage, Settings
175+
176+
# Set global settings for llama index
177+
Settings.llm = llm
178+
Settings.embed_model = embed_model
179+
180+
main_api = api("main")
181+
182+
@main_api.post("/prompt")
183+
async def query_model(ctx: HttpContext):
184+
# Pull the data from the request body
185+
query = str(ctx.req.data)
186+
187+
print(f"Querying model: \"{query}\"")
188+
189+
# Get the model from the stored local context
190+
if os.path.exists(persist_dir):
191+
storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
192+
193+
index = load_index_from_storage(storage_context)
194+
195+
# Get the query engine from the index, and use the prompt template for santisation.
196+
query_engine = index.as_query_engine(streaming=False, similarity_top_k=4, text_qa_template=text_qa_template)
197+
else:
198+
print("model does not exist")
199+
ctx.res.success= False
200+
return ctx
201+
202+
# Query the model
203+
response = query_engine.query(query)
204+
205+
ctx.res.body = f"{response}"
206+
207+
print(f"Response: \n{response}")
208+
209+
return ctx
210+
211+
Nitric.run()
212+
```
213+
214+
## Test it locally
215+
216+
Now that you have an API defined, we can test it locally. You can do this using `nitric start` and make a request to the API either through the [Nitric Dashboard](/get-started/foundations/projects/local-development#local-dashboard) or another HTTP client like cURL.
217+
218+
```bash
219+
curl -X POST http://localhost:4001/prompt -d "What is Nitric?"
220+
```
221+
222+
This should produce an output similar to:
223+
224+
```text
225+
Nitric is a cloud-agnostic framework designed to aid developers in building full cloud applications, including infrastructure. It is a declarative cloud framework with common resources like APIs, websockets, databases, queues, topics, buckets, and more. The framework provides tools for locally simulating a cloud environment, to allow an application to be tested locally, and it makes it possible to interact with resources at runtime. It is a lightweight and flexible framework that allows developers to structure their projects according to their preferences and needs. Nitric is not a replacement for IaC tools like Terraform but rather introduces a method of bringing developer self-service for infrastructure directly into the developer application. Nitric can be augmented through use of tools like Pulumi or Terraform and even be fully customized using such tools. The framework supports multiple programming languages, and its default deployment engines are built with Pulumi. Nitric provides tools for defining services in your project's `nitric.yaml` file, and each service can be run independently, allowing your app to scale and manage different workloads efficiently. Services are the heart of Nitric apps, they're the entrypoints to your code. They can serve as APIs, websockets, schedule handlers, subscribers and a lot more.
226+
```
227+
228+
## Get ready for deployment
229+
230+
Now that its tested locally, we can get our project ready for containerization. The default python dockerfile uses `python3.11-bookworm-slim` as its basic container image, which doesn't have the right dependencies to load the Llama model. So, all we need to do is update the Dockerfile to use python3.11-bookworm (the non-slim version) instead.
231+
232+
Update line 2:
233+
234+
```dockerfile title:python.dockerfile
235+
# !diff -
236+
FROM ghcr.io/astral-sh/uv:python3.11-bookworm-slim AS builder
237+
# !diff +
238+
FROM ghcr.io/astral-sh/uv:python3.11-bookworm AS builder
239+
```
240+
241+
And line 18:
242+
243+
```dockerfile title:python.dockerfile
244+
# !diff -
245+
FROM python:3.11-slim-bookworm
246+
# !diff +
247+
FROM python:3.11-bookworm
248+
```
249+
250+
When you're ready to deploy the project, we can create a new Nitric stack file that will target AWS:
251+
252+
```bash
253+
nitric stack new dev aws
254+
```
255+
256+
Update the stack file `nitric.dev.yaml` with the appropriate AWS region and memory allocation to handle the model:
257+
258+
```yaml title:nitric.dev.yaml
259+
provider: nitric/[email protected]
260+
region: us-east-1
261+
config:
262+
# How services will be deployed by default, if you have other services not running models
263+
# you can add them here too so they don't use the same configuration
264+
default:
265+
lambda:
266+
# Set the memory to 6GB to handle the model, this automatically sets additional CPU allocation
267+
memory: 6144
268+
# Set a timeout of 30 seconds (this is the most API Gateway will wait for a response)
269+
timeout: 30
270+
# We add more storage to the lambda function, so it can store the model
271+
ephemeral-storage: 1024
272+
```
273+
274+
We can then deploy using the following command:
275+
276+
```bash
277+
nitric up
278+
```
279+
280+
Testing on AWS will be the same as we did locally, we'll just use cURL to make a request to the API URL that was outputted at the end of the deployment.
281+
282+
```bash
283+
curl -x POST {your AWS endpoint URL here}/prompt -d "What is Nitric?"
284+
```
285+
286+
Once you're finished querying the model, you can destroy the deployment using `nitric down`.
287+
288+
## Summary
289+
290+
In this project we've successfully augmented an LLM using Retrieval Augmented Generation (RAG) techniques with Llama Index and Nitric. You can modify this project to use any LLM, change the prompt template to be more specific in responses, or change the context for your own personal requirements. We could extend this project to maintain context between requests using WebSockets to have more of a chat-like experience with the model.

0 commit comments

Comments
 (0)