Skip to content

Commit 538d521

Browse files
authored
Merge pull request #24 from oracle-devrel/update
Update
2 parents 4f986f8 + d3261e3 commit 538d521

30 files changed

+2859
-434
lines changed

agentic_rag/README.md

Lines changed: 64 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
An intelligent RAG (Retrieval Augmented Generation) system that uses an LLM agent to make decisions about information retrieval and response generation. The system processes PDF documents and can intelligently decide which knowledge base to query based on the user's question.
66

7+
<img src="img/architecture.png" alt="CoT output" width="80%">
8+
79
The system has the following features:
810

911
- Intelligent query routing
@@ -12,8 +14,19 @@ The system has the following features:
1214
- Smart context retrieval and response generation
1315
- FastAPI-based REST API for document upload and querying
1416
- Support for both OpenAI-based agents or local, transformer-based agents (`Mistral-7B` by default)
17+
- Support for quantized models (4-bit/8-bit) and Ollama models for faster inference
1518
- Optional Chain of Thought (CoT) reasoning for more detailed and structured responses
1619

20+
<img src="img/gradio_1.png" alt="Gradio Interface" width="80%">
21+
22+
<img src="img/gradio_2.png" alt="Gradio Interface" width="80%">
23+
24+
<img src="img/gradio_3.png" alt="Gradio Interface" width="80%">
25+
26+
Here you can find a result of using Chain of Thought (CoT) reasoning:
27+
28+
<img src="img/cot_final_answer.png" alt="CoT output" width="80%">
29+
1730
## 0. Prerequisites and setup
1831

1932
### Prerequisites
@@ -29,18 +42,20 @@ The system has the following features:
2942
- Minimum 16GB RAM (recommended >24GBs)
3043
- GPU with 8GB VRAM recommended for better performance
3144
- Will run on CPU if GPU is not available, but will be significantly slower.
45+
- For quantized models (4-bit/8-bit): Reduced VRAM requirements (4-6GB) with minimal performance impact
46+
- For Ollama models: Requires Ollama to be installed and running, with significantly reduced memory requirements
3247

3348
### Setup
3449

3550
1. Clone the repository and install dependencies:
3651

3752
```bash
3853
git clone https://github.com/oracle-devrel/devrel-labs.git
39-
cd agentic-rag
54+
cd devrel-labs/agentic_rag
4055
pip install -r requirements.txt
4156
```
4257

43-
2. Authenticate with HuggingFace:
58+
2. Authenticate with HuggingFace (for Hugging Face models only):
4459

4560
The system uses `Mistral-7B` by default, which requires authentication with HuggingFace:
4661

@@ -63,6 +78,30 @@ The system has the following features:
6378
6479
If no API key is provided, the system will automatically download and use `Mistral-7B-Instruct-v0.2` for text generation when using the local model. No additional configuration is needed.
6580
81+
4. For quantized models, ensure bitsandbytes is installed:
82+
83+
```bash
84+
pip install bitsandbytes>=0.41.0
85+
```
86+
87+
5. For Ollama models, install Ollama:
88+
89+
a. Download and install Ollama from [ollama.com/download](https://ollama.com/download) for Windows, or run the following command in Linux:
90+
91+
```bash
92+
curl -fsSL https://ollama.com/install.sh | sh
93+
```
94+
95+
b. Start the Ollama service
96+
97+
c. Pull the models you want to use beforehand:
98+
99+
```bash
100+
ollama pull llama3
101+
ollama pull phi3
102+
ollama pull qwen2
103+
```
104+
66105
## 1. Getting Started
67106
68107
You can launch this solution in three ways:
@@ -93,19 +132,29 @@ python gradio_app.py
93132
94133
This will start the Gradio server and automatically open the interface in your default browser at `http://localhost:7860`. The interface has two main tabs:
95134
96-
1. **Document Processing**:
135+
1. **Model Management**:
136+
- Download models in advance to prepare them for use
137+
- View model information including size and VRAM requirements
138+
- Check download status and error messages
139+
140+
2. **Document Processing**:
97141
- Upload PDFs using the file uploader
98142
- Process web content by entering URLs
99143
- View processing status and results
100144
101-
2. **Chat Interface**:
102-
- Select between Local (Mistral) and OpenAI models
145+
3. **Chat Interface**:
146+
- Select between different model options:
147+
- Local (Mistral) - Default Mistral-7B model (recommended)
148+
- Local (Mistral) with 4-bit or 8-bit quantization for faster inference
149+
- Ollama models (llama3, phi-3, qwen2) as alternative options
150+
- OpenAI (if API key is configured)
103151
- Toggle Chain of Thought reasoning for more detailed responses
104152
- Chat with your documents using natural language
105153
- Clear chat history as needed
106154
107155
Note: The interface will automatically detect available models based on your configuration:
108-
- Local Mistral model requires HuggingFace token in `config.yaml`
156+
- Local Mistral model requires HuggingFace token in `config.yaml` (default option)
157+
- Ollama models require Ollama to be installed and running (alternative options)
109158
- OpenAI model requires API key in `.env` file
110159
111160
### 3. Using Individual Python Components via Command Line
@@ -301,14 +350,19 @@ This endpoint processes a query through the agentic RAG pipeline and returns a r
301350
302351
## Annex: Architecture
303352
353+
<img src="img/architecture.png" alt="Architecture" width="80%">
354+
304355
The system consists of several key components:
305356
306-
1. **PDF Processor**: we use Docling to extract and chunk text from PDF documents
307-
2. **Vector Store**: Manages document embeddings and similarity search using ChromaDB
308-
3. **RAG Agent**: Makes intelligent decisions about query routing and response generation
357+
1. **PDF Processor**: we use `docling` to extract and chunk text from PDF documents
358+
2. **Web Processor**: we use `trafilatura` to extract and chunk text from websites
359+
3. **GitHub Repository Processor**: we use `gitingest` to extract and chunk text from repositories
360+
4. **Vector Store**: Manages document embeddings and similarity search using `ChromaDB`
361+
5. **RAG Agent**: Makes intelligent decisions about query routing and response generation
309362
- OpenAI Agent: Uses `gpt-4-turbo-preview` for high-quality responses, but requires an OpenAI API key
310363
- Local Agent: Uses `Mistral-7B` as an open-source alternative
311-
4. **FastAPI Server**: Provides REST API endpoints for document upload and querying
364+
6. **FastAPI Server**: Provides REST API endpoints for document upload and querying
365+
7. **Gradio Interface**: Provides a user-friendly web interface for interacting with the RAG system
312366
313367
The RAG Agent flow is the following:
314368

agentic_rag/agents/agent_factory.py

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,30 @@ class Agent(BaseModel):
2727

2828
def log_prompt(self, prompt: str, prefix: str = ""):
2929
"""Log a prompt being sent to the LLM"""
30-
logger.info(f"\n{'='*80}\n{prefix} Prompt:\n{'-'*40}\n{prompt}\n{'='*80}")
30+
# Check if the prompt contains context
31+
if "Context:" in prompt:
32+
# Split the prompt at "Context:" and keep only the first part
33+
parts = prompt.split("Context:")
34+
# Keep the first part and add a note that context is omitted
35+
truncated_prompt = parts[0] + "Context: [Context omitted for brevity]"
36+
if len(parts) > 2 and "Key Findings:" in parts[1]:
37+
# For researcher prompts, keep the "Key Findings:" part
38+
key_findings_part = parts[1].split("Key Findings:")
39+
if len(key_findings_part) > 1:
40+
truncated_prompt += "\nKey Findings:" + key_findings_part[1]
41+
logger.info(f"\n{'='*80}\n{prefix} Prompt:\n{'-'*40}\n{truncated_prompt}\n{'='*80}")
42+
else:
43+
# If no context, log the full prompt
44+
logger.info(f"\n{'='*80}\n{prefix} Prompt:\n{'-'*40}\n{prompt}\n{'='*80}")
3145

3246
def log_response(self, response: str, prefix: str = ""):
3347
"""Log a response received from the LLM"""
34-
logger.info(f"\n{'='*80}\n{prefix} Response:\n{'-'*40}\n{response}\n{'='*80}")
48+
# Log the response but truncate if it's too long
49+
if len(response) > 500:
50+
truncated_response = response[:500] + "... [response truncated]"
51+
logger.info(f"\n{'='*80}\n{prefix} Response:\n{'-'*40}\n{truncated_response}\n{'='*80}")
52+
else:
53+
logger.info(f"\n{'='*80}\n{prefix} Response:\n{'-'*40}\n{response}\n{'='*80}")
3554

3655
class PlannerAgent(Agent):
3756
"""Agent responsible for breaking down problems and planning steps"""
@@ -108,6 +127,7 @@ def research(self, query: str, step: str) -> List[Dict[str, Any]]:
108127
109128
Key Findings:"""
110129

130+
# Create context string but don't log it
111131
context_str = "\n\n".join([f"Source {i+1}:\n{item['content']}" for i, item in enumerate(all_results)])
112132
prompt = ChatPromptTemplate.from_template(template)
113133
messages = prompt.format_messages(step=step, context=context_str)
@@ -140,6 +160,7 @@ def reason(self, query: str, step: str, context: List[Dict[str, Any]]) -> str:
140160
141161
Conclusion:"""
142162

163+
# Create context string but don't log it
143164
context_str = "\n\n".join([f"Context {i+1}:\n{item['content']}" for i, item in enumerate(context)])
144165
prompt = ChatPromptTemplate.from_template(template)
145166
messages = prompt.format_messages(step=step, query=query, context=context_str)
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Agentic RAG: Enterprise-Scale Multi-Agent AI System on Oracle Cloud Infrastructure
2+
3+
## Introduction
4+
5+
<img src="../img/architecture.png" width="100%">
6+
7+
Agentic RAG is an advanced Retrieval-Augmented Generation system that employs a multi-agent architecture with Chain-of-Thought reasoning, designed for enterprise-scale deployment on Oracle Cloud Infrastructure (OCI).
8+
9+
The system leverages specialized AI agents for complex document analysis and query processing, while taking advantage of OCI's managed Kubernetes service and security features for production-grade deployment.
10+
11+
With this article, we want to show you how you can get started in a few steps to install and deploy this multi-agent RAG system using Oracle Kubernetes Engine (OKE) and OCI.
12+
13+
## Features
14+
15+
This Agentic RAG system is based on the following technologies:
16+
17+
- Oracle Kubernetes Engine (OKE)
18+
- Oracle Cloud Infrastructure (OCI)
19+
- `ollama` as the inference server for most Large Language Models (LLMs) available in the solution (`llama3`, `phi3`, `qwen2`)
20+
- `Mistral-7B` language model, with an optional multi-agent Chain of Thought reasoning
21+
- `ChromaDB` as vector store and retrieval system
22+
- `Trafilatura`, `docling` and `gitingest` to extract the content from PDFs and web pages, and have them ready to be used by the RAG system
23+
- Multi-agent architecture with specialized agents:
24+
- Planner Agent: Strategic decomposition of complex queries
25+
- Research Agent: Intelligent information retrieval (from vector database)
26+
- Reasoning Agent: Logical analysis and conclusion drawing
27+
- Synthesis Agent: Comprehensive response generation
28+
- Support for both cloud-based (OpenAI) and local (Mistral-7B) language models
29+
- Step-by-step reasoning visualization
30+
- `Gradio` web interface for easy interaction with the RAG system
31+
32+
There are several benefits to using Containerized LLMs over running the LLMs directly on the cloud instances. For example:
33+
34+
- **Scalability**: you can easily scale the LLM workloads across Kubernetes clusters. In our case, we're deploying the solution with 4 agents in the same cluster, but you could deploy each agent in a different cluster if you wanted to accelerate the Chain-of-Thought reasoning processing time (horizontal scaling). You could also use vertical scaling by adding more resources to the same agent.
35+
- **Resource Optimization**: you can efficiently allocate GPU and memory resources for each agent
36+
- **Isolation**: Each agent runs in its own container for better resource management
37+
- **Version Control**: easily update and rollback LLM versions and configurations
38+
- **Reproducibility**: have a consistent environment across development and production, which is crucial when you're working with complex LLM applications
39+
- **Cost Efficiency**: you pay only for the resources you need, and when you're doen with your work, you can simply stop the Kubernetes cluster and you won't be charged for the resources anymore.
40+
- **Integration**: you can easily integrate the RAG system with other programming languages or frameworks, as we also made available a REST-based API to interact with the system, apart from the standard web interface.
41+
42+
In conclusion, it's really easy to scale your system up and down with Kubernetes, without having to worry about the underlying infrastructure, installation, configuration, etc.
43+
44+
Note that the way we've planned the infrastructure is important because it allows us to:
45+
1. Scale the `chromadb` vector store system independently
46+
2. The LLM container can be shared across agents, meaning only deploying the LLM container once, and then using it across all the agents
47+
3. The `Research Agent` can be scaled separately for parallel document processing, if needed
48+
4. Memory and GPU resources can be optimized, since there's only one LLM instance running
49+
50+
## Deployment in Kubernetes
51+
52+
We have devised two different ways to deploy in Kubernetes: either through a local or distributed system, each offering its own advantages.
53+
54+
### Local Deployment
55+
56+
This method is the easiest way to implement and deploy. We call it local because every resource is deployed in the same pod. The advantages are the following:
57+
58+
- **Simplicity**: All components run in a single pod, making deployment and management straightforward
59+
- **Easier debugging**: Troubleshooting is simpler when all logs and components are in one place (we're looking to expand the standard logging mechanism that we have right now with `fluentd`)
60+
- **Quick setup**: Ideal for testing, development, or smaller-scale deployments
61+
- **Lower complexity**: No need to configure inter-service communication or network policies like port forwarding or such mechanisms.
62+
63+
### Distributed System Deployment
64+
65+
By decoupling the `ollama` LLM inference system to another pod, we could easily ready our system for **vertical scaling**: if we're ever running out of resources or we need to use a bigger model, we don't have to worry about the other solution components not having enough resources for processing and logging: we can simply scale up our inference pod and connect it via a FastAPI or similar system to allow the Gradio interface to make calls to the model, following a distributed system architecture.
66+
67+
The advantages are:
68+
69+
- **Independent Scaling**: Each component can be scaled according to its specific resource needs
70+
- **Resource Optimization**: Dedicated resources for compute-intensive LLM inference separate from other components
71+
- **High Availability**: System remains operational even if individual components fail, and we can have multiple pods running failover LLMs to help us with disaster recovery.
72+
- **Flexible Model Deployment**: Easily swap or upgrade LLM models without affecting the rest of the system (also, with virtually zero downtime!)
73+
- **Load Balancing**: Distribute inference requests across multiple LLM pods for better performance, thus allowing concurrent users in our Gradio interface.
74+
- **Isolation**: Performance issues on the LLM side won't impact the interface
75+
- **Cost Efficiency**: Allocate expensive GPU resources only where needed (inference) while using cheaper CPU resources for other components (e.g. we use GPU for Chain of Thought reasoning, while keeping a quantized CPU LLM for standard chatting).
76+
77+
## Quick Start
78+
79+
For this solution, we have currently implemented the local system deployment, which is what we'll cover in this section.
80+
81+
First, we need to create a GPU OKE cluster with `zx` and Terraform. For this, you can follow the steps in [this repository](https://github.com/vmleon/oci-oke-gpu), or reuse your own Kubernetes cluster if you happen to already have one.
82+
83+
Then, we can start setting up the solution in our cluster by following these steps.
84+
85+
1. Clone the repository containing the Kubernetes manifests:
86+
87+
```bash
88+
git clone https://github.com/oracle-devrel/devrel-labs.git
89+
cd devrel-labs/agentic_rag/k8s
90+
```
91+
92+
2. Create a namespace:
93+
94+
```bash
95+
kubectl create namespace agentic-rag
96+
```
97+
98+
3. Create a ConfigMap:
99+
100+
This step will help our deployment for several reasons:
101+
102+
1. **Externalized Configuration**: It separates configuration from application code, following best practices for containerized applications
103+
2. **Environment-specific Settings**: Allows us to maintain different configurations for development, testing, and production environments
104+
3. **Credential Management**: Provides a way to inject API tokens (like Hugging Face) without hardcoding them in the image
105+
4. **Runtime Configuration**: Enables changing configuration without rebuilding or redeploying the application container
106+
5. **Consistency**: Ensures all pods use the same configuration when scaled horizontally
107+
108+
In our specific case, the ConfigMap stores the Hugging Face Hub token for accessing (and downloading) the `mistral-7b` model (and CPU-quantized variants)
109+
- Optionally, OpenAI API keys if using those models
110+
- Any other environment-specific variables needed by the application, in case we want to make further development and increase the capabilities of the system with external API keys, authentication tokens... etc.
111+
112+
Let's run the following command to create the config map:
113+
114+
```bash
115+
# With a Hugging Face token
116+
cat <<EOF | kubectl apply -n agentic-rag -f -
117+
apiVersion: v1
118+
kind: ConfigMap
119+
metadata:
120+
name: agentic-rag-config
121+
data:
122+
config.yaml: |
123+
HUGGING_FACE_HUB_TOKEN: "your-huggingface-token"
124+
EOF
125+
126+
# Or without a Hugging Face token
127+
cat <<EOF | kubectl apply -n agentic-rag -f -
128+
apiVersion: v1
129+
kind: ConfigMap
130+
metadata:
131+
name: agentic-rag-config
132+
data:
133+
config.yaml: |
134+
# No Hugging Face token provided
135+
# You can still use Ollama models
136+
EOF
137+
```
138+
139+
This approach makes our deployment more flexible, secure, and maintainable compared to hardcoding configuration values.
140+
141+
4. Apply the manifests:
142+
143+
```bash
144+
kubectl apply -n agentic-rag -f local-deployment/pvcs.yaml
145+
kubectl apply -n agentic-rag -f local-deployment/deployment.yaml
146+
kubectl apply -n agentic-rag -f local-deployment/service.yaml
147+
```
148+
149+
5. Monitor the Deployment
150+
151+
With the following commands, we can check the status of our pod:
152+
153+
```bash
154+
kubectl get pods -n agentic-rag
155+
```
156+
157+
And view the internal logs of the pod:
158+
159+
```bash
160+
kubectl logs -f deployment/agentic-rag -n agentic-rag
161+
```
162+
163+
6. Access the Application
164+
165+
Get the external IP address of the service:
166+
167+
```bash
168+
kubectl get service agentic-rag -n agentic-rag
169+
```
170+
171+
Access the application in your browser at `http://<EXTERNAL-IP>`.
172+
173+
## Resource Requirements
174+
175+
The deployment of this solution requires the following minimum resources:
176+
177+
- **CPU**: 4+ cores
178+
- **Memory**: 16GB+ RAM
179+
- **Storage**: 50GB+
180+
- **GPU**: recommended for faster inference. In theory, you can use `mistral-7b` CPU-quantized models, but it will be sub-optimal.
181+
182+
## Conclusion
183+
184+
You can check out the full AI solution and the deployment options we mention in this article in [the official GitHub repository](https://github.com/oracle-devrel/devrel-labs/tree/main/agentic_rag).

0 commit comments

Comments
 (0)