Skip to content

Commit 9055802

Browse files
committed
feat: added draft version of article
1 parent f24bebf commit 9055802

File tree

1 file changed

+191
-0
lines changed

1 file changed

+191
-0
lines changed
Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
# Agentic RAG: Enterprise-Scale Multi-Agent AI System on Oracle Cloud Infrastructure
2+
3+
## Introduction
4+
5+
<img src="../img/architecture.png" width="100%">
6+
7+
Agentic RAG is an advanced Retrieval-Augmented Generation system that employs a multi-agent architecture with Chain-of-Thought reasoning, designed for enterprise-scale deployment on Oracle Cloud Infrastructure (OCI).
8+
9+
The system leverages specialized AI agents for complex document analysis and query processing, while taking advantage of OCI's managed Kubernetes service and security features for production-grade deployment.
10+
11+
With this article, we want to show you how you can get started in a few steps to install and deploy this multi-agent RAG system using Oracle Kubernetes Engine (OKE) and OCI.
12+
13+
## Features
14+
15+
This Agentic RAG system is based on the following technologies:
16+
17+
- Oracle Kubernetes Engine (OKE)
18+
- Oracle Cloud Infrastructure (OCI)
19+
- `ollama` as the inference server for most Large Language Models (LLMs) available in the solution (`llama3`, `phi3`, `qwen2`)
20+
- `Mistral-7B` language model, with an optional multi-agent Chain of Thought reasoning
21+
- `ChromaDB` as vector store and retrieval system
22+
- `Trafilatura`, `docling` and `gitingest` to extract the content from PDFs and web pages, and have them ready to be used by the RAG system
23+
- Multi-agent architecture with specialized agents:
24+
- Planner Agent: Strategic decomposition of complex queries
25+
- Research Agent: Intelligent information retrieval (from vector database)
26+
- Reasoning Agent: Logical analysis and conclusion drawing
27+
- Synthesis Agent: Comprehensive response generation
28+
- Support for both cloud-based (OpenAI) and local (Mistral-7B) language models
29+
- Step-by-step reasoning visualization
30+
- `Gradio` web interface for easy interaction with the RAG system
31+
32+
There are several benefits to using Containerized LLMs over running the LLMs directly on the cloud instances. For example:
33+
34+
- **Scalability**: you can easily scale the LLM workloads across Kubernetes clusters. In our case, we're deploying the solution with 4 agents in the same cluster, but you could deploy each agent in a different cluster if you wanted to accelerate the Chain-of-Thought reasoning processing time (horizontal scaling). You could also use vertical scaling by adding more resources to the same agent.
35+
- **Resource Optimization**: you can efficiently allocate GPU and memory resources for each agent
36+
- **Isolation**: Each agent runs in its own container for better resource management
37+
- **Version Control**: easily update and rollback LLM versions and configurations
38+
- **Reproducibility**: have a consistent environment across development and production, which is crucial when you're working with complex LLM applications
39+
- **Cost Efficiency**: you pay only for the resources you need, and when you're doen with your work, you can simply stop the Kubernetes cluster and you won't be charged for the resources anymore.
40+
- **Integration**: you can easily integrate the RAG system with other programming languages or frameworks, as we also made available a REST-based API to interact with the system, apart from the standard web interface.
41+
42+
In conclusion, it's really easy to scale your system up and down with Kubernetes, without having to worry about the underlying infrastructure, installation, configuration, etc.
43+
44+
Note that the way we've planned the infrastructure is important because it allows us to:
45+
1. Scale the `chromadb` vector store system independently
46+
2. The LLM container can be shared across agents, meaning only deploying the LLM container once, and then using it across all the agents
47+
3. The `Research Agent` can be scaled separately for parallel document processing, if needed
48+
4. Memory and GPU resources can be optimized, since there's only one LLM instance running
49+
50+
## Deployment in Kubernetes
51+
52+
We have devised two different ways to deploy in Kubernetes: either through a local or distributed system, each offering its own advantages.
53+
54+
### Local Deployment
55+
56+
This method is the easiest way to implement and deploy. We call it local because every resource is deployed in the same pod. The advantages are the following:
57+
58+
- **Simplicity**: All components run in a single pod, making deployment and management straightforward
59+
- **Easier debugging**: Troubleshooting is simpler when all logs and components are in one place (we're looking to expand the standard logging mechanism that we have right now with `fluentd`)
60+
- **Quick setup**: Ideal for testing, development, or smaller-scale deployments
61+
- **Lower complexity**: No need to configure inter-service communication or network policies like port forwarding or such mechanisms.
62+
63+
### Distributed System Deployment
64+
65+
By decoupling the `ollama` LLM inference system to another pod, we can easily ready our system for **vertical scaling**: if we're ever running out of resources or we need to use a bigger model, we don't have to worry about the other solution components not having enough resources for processing and logging: we can simply scale up our inference pod and connect it via a FastAPI or similar system to allow the Gradio interface to make calls to the model, following a distributed system architecture.
66+
67+
The advantages are:
68+
69+
- **Independent Scaling**: Each component can be scaled according to its specific resource needs
70+
- **Resource Optimization**: Dedicated resources for compute-intensive LLM inference separate from other components
71+
- **High Availability**: System remains operational even if individual components fail, and we can have multiple pods running failover LLMs to help us with disaster recovery.
72+
- **Flexible Model Deployment**: Easily swap or upgrade LLM models without affecting the rest of the system (also, with 0 downtime!)
73+
- **Load Balancing**: Distribute inference requests across multiple LLM pods for better performance, thus allowing concurrent users in our Gradio interface.
74+
- **Isolation**: Performance issues on the LLM side won't impact the interface
75+
- **Cost Efficiency**: Allocate expensive GPU resources only where needed (inference) while using cheaper CPU resources for other components (e.g. we use GPU for Chain of Thought reasoning, while keeping a quantized CPU LLM for standard chatting).
76+
77+
## Quick Start
78+
79+
### Step by Step Deployment
80+
81+
0. Clone the repository containing the Kubernetes manifests:
82+
83+
```bash
84+
git clone https://github.com/oracle-devrel/devrel-labs.git
85+
cd devrel-labs/agentic_rag/k8s
86+
```
87+
88+
1. Create a namespace:
89+
90+
```bash
91+
kubectl create namespace agentic-rag
92+
```
93+
94+
2. Create a ConfigMap:
95+
96+
This step will help our deployment for several reasons:
97+
98+
1. **Externalized Configuration**: It separates configuration from application code, following best practices for containerized applications
99+
2. **Environment-specific Settings**: Allows us to maintain different configurations for development, testing, and production environments
100+
3. **Credential Management**: Provides a way to inject API tokens (like Hugging Face) without hardcoding them in the image
101+
4. **Runtime Configuration**: Enables changing configuration without rebuilding or redeploying the application container
102+
5. **Consistency**: Ensures all pods use the same configuration when scaled horizontally
103+
104+
In our specific case, the ConfigMap stores the Hugging Face Hub token for accessing (and downloading) the `mistral-7b` model (and CPU-quantized variants)
105+
- Optionally, OpenAI API keys if using those models
106+
- Any other environment-specific variables needed by the application, in case we want to make further development and increase the capabilities of the system with external API keys, authentication tokens... etc.
107+
108+
Let's run the following command to create the config map:
109+
110+
```bash
111+
# With a Hugging Face token
112+
cat <<EOF | kubectl apply -n agentic-rag -f -
113+
apiVersion: v1
114+
kind: ConfigMap
115+
metadata:
116+
name: agentic-rag-config
117+
data:
118+
config.yaml: |
119+
HUGGING_FACE_HUB_TOKEN: "your-huggingface-token"
120+
EOF
121+
122+
# Or without a Hugging Face token
123+
cat <<EOF | kubectl apply -n agentic-rag -f -
124+
apiVersion: v1
125+
kind: ConfigMap
126+
metadata:
127+
name: agentic-rag-config
128+
data:
129+
config.yaml: |
130+
# No Hugging Face token provided
131+
# You can still use Ollama models
132+
EOF
133+
```
134+
135+
This approach makes our deployment more flexible, secure, and maintainable compared to hardcoding configuration values.
136+
137+
3. Apply the manifests:
138+
139+
```bash
140+
kubectl apply -n agentic-rag -f local-deployment/deployment.yaml
141+
kubectl apply -n agentic-rag -f local-deployment/service.yaml
142+
```
143+
144+
4. Monitor the Deployment
145+
146+
With the following commands, we can check the status of our pods:
147+
148+
```bash
149+
kubectl get pods -n agentic-rag
150+
```
151+
152+
And view the internal logs of the pod:
153+
154+
```bash
155+
kubectl logs -f deployment/agentic-rag -n agentic-rag
156+
```
157+
158+
6. Access the Application
159+
160+
Get the external IP address of the service:
161+
162+
```bash
163+
kubectl get service agentic-rag -n agentic-rag
164+
```
165+
166+
Access the application in your browser at `http://<EXTERNAL-IP>`.
167+
168+
169+
### Shell Script Deployment
170+
171+
For a quick start, use the deployment script. Just go into the script and replace your `HF_TOKEN` in line 17 (if you're planning on using `mistral-7b`, or leave it as-is if you're planning on using `ollama`):
172+
173+
```bash
174+
# Make the script executable
175+
chmod +x deploy.sh
176+
177+
# Deploy with a Hugging Face token
178+
./deploy.sh --hf-token "your-huggingface-token" --namespace agentic-rag
179+
180+
# Or deploy without a Hugging Face token (Ollama models only)
181+
./deploy.sh --namespace agentic-rag
182+
```
183+
184+
## Resource Requirements
185+
186+
The deployment of this solution requires the following minimum resources:
187+
188+
- **CPU**: 4+ cores
189+
- **Memory**: 16GB+ RAM
190+
- **Storage**: 50GB+
191+
- **GPU**: recommended for faster inference. In theory, you can use `mistral-7b` CPU-quantized models, but it will be sub-optimal.

0 commit comments

Comments
 (0)