Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
378b28b
feat: minor changes, created carbon png for pull commands and inserte…
jasperan Feb 26, 2025
f11c915
Fix agentic_rag collection selection to respect user choice. When PDF…
jasperan Feb 26, 2025
0e2b758
feat: added gradio images
jasperan Feb 26, 2025
3824d5a
Optimize agentic_rag performance by skipping query analysis when Gene…
jasperan Feb 26, 2025
163afba
Fix query analysis in agentic_rag to completely skip analysis for Gen…
jasperan Feb 26, 2025
dc08bca
Fix query analysis in standard chat interface to completely skip anal…
jasperan Feb 26, 2025
db3717c
Improve user experience by removing information disclaimers and Spani…
jasperan Feb 26, 2025
52c1ff7
Fix query analysis to ensure query_type matches selected collection
jasperan Feb 26, 2025
ac10f5f
Remove query analysis completely to always use selected collection
jasperan Feb 26, 2025
60749b3
Add numbered source citations and reduce verbosity in console logs
jasperan Feb 26, 2025
6db78ce
feat: added gradio images to intro, updated readme, added architecture
jasperan Feb 26, 2025
59966c9
feat: added cot result and architecture to introduction
jasperan Feb 27, 2025
ad4175b
Add Coqui-TTS support to root tts_generator.py
jasperan Feb 27, 2025
f543b0a
Fix TTS generator issues: Parler import path and Bark max_new_tokens …
jasperan Feb 27, 2025
be6eac8
Fix max_new_tokens parameter conflict in Bark TTS generator
jasperan Feb 27, 2025
5f18487
Enhance warning suppression for transformers attention mask warnings
jasperan Feb 27, 2025
36be478
Fix max_new_tokens parameter conflict in Bark TTS generator
jasperan Feb 27, 2025
651a68b
Add new projects and files including planeLLM, OCI subtitle translati…
jasperan Feb 27, 2025
be1f104
feat: fixed tts generator, cleanup repo
jasperan Feb 27, 2025
f6c4383
feat: added new image to display streamlit interface, updated readme …
jasperan Mar 4, 2025
3dd3d09
feat: added output JSON translated file
jasperan Mar 4, 2025
6746b12
feat: added readme instructions for new model downloads and GGUF models
jasperan Mar 7, 2025
1e8b9dc
feat: added model downloads chapter and new available GGUF and quanti…
jasperan Mar 7, 2025
b24603b
feat: added model downloads chapter and new available GGUF and quanti…
jasperan Mar 7, 2025
f36186c
feat: updated reqs for bitsandbytes, huggingface and llama-cpp-python
jasperan Mar 7, 2025
08fb3ee
feat: added ollama official support
jasperan Mar 7, 2025
ba48033
feat: enhanced model downloads
jasperan Mar 8, 2025
1c7dd2e
fix: Error processing query: 'str' object does not support item assig…
jasperan Mar 11, 2025
0b713f7
feat: removed context printing in console
jasperan Mar 11, 2025
9fd6437
feat: updated architecture page with ollama progress
jasperan Mar 16, 2025
fc67c11
feat: added test kubernetes files
jasperan Mar 16, 2025
f24bebf
feat: removed nvidia gpu requirements
jasperan Mar 16, 2025
9055802
feat: added draft version of article
jasperan Mar 17, 2025
5240694
Fix repo URL in deployment.yaml
vmleon Mar 18, 2025
187250e
Merge pull request #23 from vmleon/patch-1
jasperan Mar 18, 2025
2796404
feat: added pvcs
jasperan Mar 18, 2025
783920a
fix: fatal: destination path '.' already exists and is not an empty d…
jasperan Mar 18, 2025
7c8a03f
fix: /bin/bash: line 41: cd: agentic_rag: No such file or directory
jasperan Mar 19, 2025
d3261e3
feat: added final version of kubernetes rag article and readme for k8…
jasperan Mar 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 64 additions & 10 deletions agentic_rag/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

An intelligent RAG (Retrieval Augmented Generation) system that uses an LLM agent to make decisions about information retrieval and response generation. The system processes PDF documents and can intelligently decide which knowledge base to query based on the user's question.

<img src="img/architecture.png" alt="CoT output" width="80%">

The system has the following features:

- Intelligent query routing
Expand All @@ -12,8 +14,19 @@ The system has the following features:
- Smart context retrieval and response generation
- FastAPI-based REST API for document upload and querying
- Support for both OpenAI-based agents or local, transformer-based agents (`Mistral-7B` by default)
- Support for quantized models (4-bit/8-bit) and Ollama models for faster inference
- Optional Chain of Thought (CoT) reasoning for more detailed and structured responses

<img src="img/gradio_1.png" alt="Gradio Interface" width="80%">

<img src="img/gradio_2.png" alt="Gradio Interface" width="80%">

<img src="img/gradio_3.png" alt="Gradio Interface" width="80%">

Here you can find a result of using Chain of Thought (CoT) reasoning:

<img src="img/cot_final_answer.png" alt="CoT output" width="80%">

## 0. Prerequisites and setup

### Prerequisites
Expand All @@ -29,18 +42,20 @@ The system has the following features:
- Minimum 16GB RAM (recommended >24GBs)
- GPU with 8GB VRAM recommended for better performance
- Will run on CPU if GPU is not available, but will be significantly slower.
- For quantized models (4-bit/8-bit): Reduced VRAM requirements (4-6GB) with minimal performance impact
- For Ollama models: Requires Ollama to be installed and running, with significantly reduced memory requirements

### Setup

1. Clone the repository and install dependencies:

```bash
git clone https://github.com/oracle-devrel/devrel-labs.git
cd agentic-rag
cd devrel-labs/agentic_rag
pip install -r requirements.txt
```

2. Authenticate with HuggingFace:
2. Authenticate with HuggingFace (for Hugging Face models only):

The system uses `Mistral-7B` by default, which requires authentication with HuggingFace:

Expand All @@ -63,6 +78,30 @@ The system has the following features:

If no API key is provided, the system will automatically download and use `Mistral-7B-Instruct-v0.2` for text generation when using the local model. No additional configuration is needed.

4. For quantized models, ensure bitsandbytes is installed:

```bash
pip install bitsandbytes>=0.41.0
```

5. For Ollama models, install Ollama:

a. Download and install Ollama from [ollama.com/download](https://ollama.com/download) for Windows, or run the following command in Linux:

```bash
curl -fsSL https://ollama.com/install.sh | sh
```

b. Start the Ollama service

c. Pull the models you want to use beforehand:

```bash
ollama pull llama3
ollama pull phi3
ollama pull qwen2
```

## 1. Getting Started

You can launch this solution in three ways:
Expand Down Expand Up @@ -93,19 +132,29 @@ python gradio_app.py

This will start the Gradio server and automatically open the interface in your default browser at `http://localhost:7860`. The interface has two main tabs:

1. **Document Processing**:
1. **Model Management**:
- Download models in advance to prepare them for use
- View model information including size and VRAM requirements
- Check download status and error messages

2. **Document Processing**:
- Upload PDFs using the file uploader
- Process web content by entering URLs
- View processing status and results

2. **Chat Interface**:
- Select between Local (Mistral) and OpenAI models
3. **Chat Interface**:
- Select between different model options:
- Local (Mistral) - Default Mistral-7B model (recommended)
- Local (Mistral) with 4-bit or 8-bit quantization for faster inference
- Ollama models (llama3, phi-3, qwen2) as alternative options
- OpenAI (if API key is configured)
- Toggle Chain of Thought reasoning for more detailed responses
- Chat with your documents using natural language
- Clear chat history as needed

Note: The interface will automatically detect available models based on your configuration:
- Local Mistral model requires HuggingFace token in `config.yaml`
- Local Mistral model requires HuggingFace token in `config.yaml` (default option)
- Ollama models require Ollama to be installed and running (alternative options)
- OpenAI model requires API key in `.env` file

### 3. Using Individual Python Components via Command Line
Expand Down Expand Up @@ -301,14 +350,19 @@ This endpoint processes a query through the agentic RAG pipeline and returns a r

## Annex: Architecture

<img src="img/architecture.png" alt="Architecture" width="80%">

The system consists of several key components:

1. **PDF Processor**: we use Docling to extract and chunk text from PDF documents
2. **Vector Store**: Manages document embeddings and similarity search using ChromaDB
3. **RAG Agent**: Makes intelligent decisions about query routing and response generation
1. **PDF Processor**: we use `docling` to extract and chunk text from PDF documents
2. **Web Processor**: we use `trafilatura` to extract and chunk text from websites
3. **GitHub Repository Processor**: we use `gitingest` to extract and chunk text from repositories
4. **Vector Store**: Manages document embeddings and similarity search using `ChromaDB`
5. **RAG Agent**: Makes intelligent decisions about query routing and response generation
- OpenAI Agent: Uses `gpt-4-turbo-preview` for high-quality responses, but requires an OpenAI API key
- Local Agent: Uses `Mistral-7B` as an open-source alternative
4. **FastAPI Server**: Provides REST API endpoints for document upload and querying
6. **FastAPI Server**: Provides REST API endpoints for document upload and querying
7. **Gradio Interface**: Provides a user-friendly web interface for interacting with the RAG system

The RAG Agent flow is the following:

Expand Down
25 changes: 23 additions & 2 deletions agentic_rag/agents/agent_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,30 @@ class Agent(BaseModel):

def log_prompt(self, prompt: str, prefix: str = ""):
"""Log a prompt being sent to the LLM"""
logger.info(f"\n{'='*80}\n{prefix} Prompt:\n{'-'*40}\n{prompt}\n{'='*80}")
# Check if the prompt contains context
if "Context:" in prompt:
# Split the prompt at "Context:" and keep only the first part
parts = prompt.split("Context:")
# Keep the first part and add a note that context is omitted
truncated_prompt = parts[0] + "Context: [Context omitted for brevity]"
if len(parts) > 2 and "Key Findings:" in parts[1]:
# For researcher prompts, keep the "Key Findings:" part
key_findings_part = parts[1].split("Key Findings:")
if len(key_findings_part) > 1:
truncated_prompt += "\nKey Findings:" + key_findings_part[1]
logger.info(f"\n{'='*80}\n{prefix} Prompt:\n{'-'*40}\n{truncated_prompt}\n{'='*80}")
else:
# If no context, log the full prompt
logger.info(f"\n{'='*80}\n{prefix} Prompt:\n{'-'*40}\n{prompt}\n{'='*80}")

def log_response(self, response: str, prefix: str = ""):
"""Log a response received from the LLM"""
logger.info(f"\n{'='*80}\n{prefix} Response:\n{'-'*40}\n{response}\n{'='*80}")
# Log the response but truncate if it's too long
if len(response) > 500:
truncated_response = response[:500] + "... [response truncated]"
logger.info(f"\n{'='*80}\n{prefix} Response:\n{'-'*40}\n{truncated_response}\n{'='*80}")
else:
logger.info(f"\n{'='*80}\n{prefix} Response:\n{'-'*40}\n{response}\n{'='*80}")

class PlannerAgent(Agent):
"""Agent responsible for breaking down problems and planning steps"""
Expand Down Expand Up @@ -108,6 +127,7 @@ def research(self, query: str, step: str) -> List[Dict[str, Any]]:

Key Findings:"""

# Create context string but don't log it
context_str = "\n\n".join([f"Source {i+1}:\n{item['content']}" for i, item in enumerate(all_results)])
prompt = ChatPromptTemplate.from_template(template)
messages = prompt.format_messages(step=step, context=context_str)
Expand Down Expand Up @@ -140,6 +160,7 @@ def reason(self, query: str, step: str, context: List[Dict[str, Any]]) -> str:

Conclusion:"""

# Create context string but don't log it
context_str = "\n\n".join([f"Context {i+1}:\n{item['content']}" for i, item in enumerate(context)])
prompt = ChatPromptTemplate.from_template(template)
messages = prompt.format_messages(step=step, query=query, context=context_str)
Expand Down
184 changes: 184 additions & 0 deletions agentic_rag/articles/kubernetes_rag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Agentic RAG: Enterprise-Scale Multi-Agent AI System on Oracle Cloud Infrastructure

## Introduction

<img src="../img/architecture.png" width="100%">

Agentic RAG is an advanced Retrieval-Augmented Generation system that employs a multi-agent architecture with Chain-of-Thought reasoning, designed for enterprise-scale deployment on Oracle Cloud Infrastructure (OCI).

The system leverages specialized AI agents for complex document analysis and query processing, while taking advantage of OCI's managed Kubernetes service and security features for production-grade deployment.

With this article, we want to show you how you can get started in a few steps to install and deploy this multi-agent RAG system using Oracle Kubernetes Engine (OKE) and OCI.

## Features

This Agentic RAG system is based on the following technologies:

- Oracle Kubernetes Engine (OKE)
- Oracle Cloud Infrastructure (OCI)
- `ollama` as the inference server for most Large Language Models (LLMs) available in the solution (`llama3`, `phi3`, `qwen2`)
- `Mistral-7B` language model, with an optional multi-agent Chain of Thought reasoning
- `ChromaDB` as vector store and retrieval system
- `Trafilatura`, `docling` and `gitingest` to extract the content from PDFs and web pages, and have them ready to be used by the RAG system
- Multi-agent architecture with specialized agents:
- Planner Agent: Strategic decomposition of complex queries
- Research Agent: Intelligent information retrieval (from vector database)
- Reasoning Agent: Logical analysis and conclusion drawing
- Synthesis Agent: Comprehensive response generation
- Support for both cloud-based (OpenAI) and local (Mistral-7B) language models
- Step-by-step reasoning visualization
- `Gradio` web interface for easy interaction with the RAG system

There are several benefits to using Containerized LLMs over running the LLMs directly on the cloud instances. For example:

- **Scalability**: you can easily scale the LLM workloads across Kubernetes clusters. In our case, we're deploying the solution with 4 agents in the same cluster, but you could deploy each agent in a different cluster if you wanted to accelerate the Chain-of-Thought reasoning processing time (horizontal scaling). You could also use vertical scaling by adding more resources to the same agent.
- **Resource Optimization**: you can efficiently allocate GPU and memory resources for each agent
- **Isolation**: Each agent runs in its own container for better resource management
- **Version Control**: easily update and rollback LLM versions and configurations
- **Reproducibility**: have a consistent environment across development and production, which is crucial when you're working with complex LLM applications
- **Cost Efficiency**: you pay only for the resources you need, and when you're doen with your work, you can simply stop the Kubernetes cluster and you won't be charged for the resources anymore.
- **Integration**: you can easily integrate the RAG system with other programming languages or frameworks, as we also made available a REST-based API to interact with the system, apart from the standard web interface.

In conclusion, it's really easy to scale your system up and down with Kubernetes, without having to worry about the underlying infrastructure, installation, configuration, etc.

Note that the way we've planned the infrastructure is important because it allows us to:
1. Scale the `chromadb` vector store system independently
2. The LLM container can be shared across agents, meaning only deploying the LLM container once, and then using it across all the agents
3. The `Research Agent` can be scaled separately for parallel document processing, if needed
4. Memory and GPU resources can be optimized, since there's only one LLM instance running

## Deployment in Kubernetes

We have devised two different ways to deploy in Kubernetes: either through a local or distributed system, each offering its own advantages.

### Local Deployment

This method is the easiest way to implement and deploy. We call it local because every resource is deployed in the same pod. The advantages are the following:

- **Simplicity**: All components run in a single pod, making deployment and management straightforward
- **Easier debugging**: Troubleshooting is simpler when all logs and components are in one place (we're looking to expand the standard logging mechanism that we have right now with `fluentd`)
- **Quick setup**: Ideal for testing, development, or smaller-scale deployments
- **Lower complexity**: No need to configure inter-service communication or network policies like port forwarding or such mechanisms.

### Distributed System Deployment

By decoupling the `ollama` LLM inference system to another pod, we could easily ready our system for **vertical scaling**: if we're ever running out of resources or we need to use a bigger model, we don't have to worry about the other solution components not having enough resources for processing and logging: we can simply scale up our inference pod and connect it via a FastAPI or similar system to allow the Gradio interface to make calls to the model, following a distributed system architecture.

The advantages are:

- **Independent Scaling**: Each component can be scaled according to its specific resource needs
- **Resource Optimization**: Dedicated resources for compute-intensive LLM inference separate from other components
- **High Availability**: System remains operational even if individual components fail, and we can have multiple pods running failover LLMs to help us with disaster recovery.
- **Flexible Model Deployment**: Easily swap or upgrade LLM models without affecting the rest of the system (also, with virtually zero downtime!)
- **Load Balancing**: Distribute inference requests across multiple LLM pods for better performance, thus allowing concurrent users in our Gradio interface.
- **Isolation**: Performance issues on the LLM side won't impact the interface
- **Cost Efficiency**: Allocate expensive GPU resources only where needed (inference) while using cheaper CPU resources for other components (e.g. we use GPU for Chain of Thought reasoning, while keeping a quantized CPU LLM for standard chatting).

## Quick Start

For this solution, we have currently implemented the local system deployment, which is what we'll cover in this section.

First, we need to create a GPU OKE cluster with `zx` and Terraform. For this, you can follow the steps in [this repository](https://github.com/vmleon/oci-oke-gpu), or reuse your own Kubernetes cluster if you happen to already have one.

Then, we can start setting up the solution in our cluster by following these steps.

1. Clone the repository containing the Kubernetes manifests:

```bash
git clone https://github.com/oracle-devrel/devrel-labs.git
cd devrel-labs/agentic_rag/k8s
```

2. Create a namespace:

```bash
kubectl create namespace agentic-rag
```

3. Create a ConfigMap:

This step will help our deployment for several reasons:

1. **Externalized Configuration**: It separates configuration from application code, following best practices for containerized applications
2. **Environment-specific Settings**: Allows us to maintain different configurations for development, testing, and production environments
3. **Credential Management**: Provides a way to inject API tokens (like Hugging Face) without hardcoding them in the image
4. **Runtime Configuration**: Enables changing configuration without rebuilding or redeploying the application container
5. **Consistency**: Ensures all pods use the same configuration when scaled horizontally

In our specific case, the ConfigMap stores the Hugging Face Hub token for accessing (and downloading) the `mistral-7b` model (and CPU-quantized variants)
- Optionally, OpenAI API keys if using those models
- Any other environment-specific variables needed by the application, in case we want to make further development and increase the capabilities of the system with external API keys, authentication tokens... etc.

Let's run the following command to create the config map:

```bash
# With a Hugging Face token
cat <<EOF | kubectl apply -n agentic-rag -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: agentic-rag-config
data:
config.yaml: |
HUGGING_FACE_HUB_TOKEN: "your-huggingface-token"
EOF

# Or without a Hugging Face token
cat <<EOF | kubectl apply -n agentic-rag -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: agentic-rag-config
data:
config.yaml: |
# No Hugging Face token provided
# You can still use Ollama models
EOF
```

This approach makes our deployment more flexible, secure, and maintainable compared to hardcoding configuration values.

4. Apply the manifests:

```bash
kubectl apply -n agentic-rag -f local-deployment/pvcs.yaml
kubectl apply -n agentic-rag -f local-deployment/deployment.yaml
kubectl apply -n agentic-rag -f local-deployment/service.yaml
```

5. Monitor the Deployment

With the following commands, we can check the status of our pod:

```bash
kubectl get pods -n agentic-rag
```

And view the internal logs of the pod:

```bash
kubectl logs -f deployment/agentic-rag -n agentic-rag
```

6. Access the Application

Get the external IP address of the service:

```bash
kubectl get service agentic-rag -n agentic-rag
```

Access the application in your browser at `http://<EXTERNAL-IP>`.

## Resource Requirements

The deployment of this solution requires the following minimum resources:

- **CPU**: 4+ cores
- **Memory**: 16GB+ RAM
- **Storage**: 50GB+
- **GPU**: recommended for faster inference. In theory, you can use `mistral-7b` CPU-quantized models, but it will be sub-optimal.

## Conclusion

You can check out the full AI solution and the deployment options we mention in this article in [the official GitHub repository](https://github.com/oracle-devrel/devrel-labs/tree/main/agentic_rag).
Loading