Skip to content

Commit d3261e3

Browse files
committed
feat: added final version of kubernetes rag article and readme for k8s/ dir
1 parent 7c8a03f commit d3261e3

File tree

2 files changed

+178
-33
lines changed

2 files changed

+178
-33
lines changed

agentic_rag/articles/kubernetes_rag.md

Lines changed: 26 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -62,36 +62,40 @@ This method is the easiest way to implement and deploy. We call it local because
6262

6363
### Distributed System Deployment
6464

65-
By decoupling the `ollama` LLM inference system to another pod, we can easily ready our system for **vertical scaling**: if we're ever running out of resources or we need to use a bigger model, we don't have to worry about the other solution components not having enough resources for processing and logging: we can simply scale up our inference pod and connect it via a FastAPI or similar system to allow the Gradio interface to make calls to the model, following a distributed system architecture.
65+
By decoupling the `ollama` LLM inference system to another pod, we could easily ready our system for **vertical scaling**: if we're ever running out of resources or we need to use a bigger model, we don't have to worry about the other solution components not having enough resources for processing and logging: we can simply scale up our inference pod and connect it via a FastAPI or similar system to allow the Gradio interface to make calls to the model, following a distributed system architecture.
6666

6767
The advantages are:
6868

6969
- **Independent Scaling**: Each component can be scaled according to its specific resource needs
7070
- **Resource Optimization**: Dedicated resources for compute-intensive LLM inference separate from other components
7171
- **High Availability**: System remains operational even if individual components fail, and we can have multiple pods running failover LLMs to help us with disaster recovery.
72-
- **Flexible Model Deployment**: Easily swap or upgrade LLM models without affecting the rest of the system (also, with 0 downtime!)
72+
- **Flexible Model Deployment**: Easily swap or upgrade LLM models without affecting the rest of the system (also, with virtually zero downtime!)
7373
- **Load Balancing**: Distribute inference requests across multiple LLM pods for better performance, thus allowing concurrent users in our Gradio interface.
7474
- **Isolation**: Performance issues on the LLM side won't impact the interface
7575
- **Cost Efficiency**: Allocate expensive GPU resources only where needed (inference) while using cheaper CPU resources for other components (e.g. we use GPU for Chain of Thought reasoning, while keeping a quantized CPU LLM for standard chatting).
7676

7777
## Quick Start
7878

79-
### Step by Step Deployment
79+
For this solution, we have currently implemented the local system deployment, which is what we'll cover in this section.
8080

81-
0. Clone the repository containing the Kubernetes manifests:
81+
First, we need to create a GPU OKE cluster with `zx` and Terraform. For this, you can follow the steps in [this repository](https://github.com/vmleon/oci-oke-gpu), or reuse your own Kubernetes cluster if you happen to already have one.
8282

83-
```bash
84-
git clone https://github.com/oracle-devrel/devrel-labs.git
85-
cd devrel-labs/agentic_rag/k8s
86-
```
83+
Then, we can start setting up the solution in our cluster by following these steps.
8784

88-
1. Create a namespace:
85+
1. Clone the repository containing the Kubernetes manifests:
8986

90-
```bash
91-
kubectl create namespace agentic-rag
92-
```
87+
```bash
88+
git clone https://github.com/oracle-devrel/devrel-labs.git
89+
cd devrel-labs/agentic_rag/k8s
90+
```
91+
92+
2. Create a namespace:
93+
94+
```bash
95+
kubectl create namespace agentic-rag
96+
```
9397

94-
2. Create a ConfigMap:
98+
3. Create a ConfigMap:
9599

96100
This step will help our deployment for several reasons:
97101

@@ -134,16 +138,17 @@ kubectl create namespace agentic-rag
134138
135139
This approach makes our deployment more flexible, secure, and maintainable compared to hardcoding configuration values.
136140
137-
3. Apply the manifests:
141+
4. Apply the manifests:
138142
139143
```bash
144+
kubectl apply -n agentic-rag -f local-deployment/pvcs.yaml
140145
kubectl apply -n agentic-rag -f local-deployment/deployment.yaml
141146
kubectl apply -n agentic-rag -f local-deployment/service.yaml
142147
```
143148
144-
4. Monitor the Deployment
149+
5. Monitor the Deployment
145150
146-
With the following commands, we can check the status of our pods:
151+
With the following commands, we can check the status of our pod:
147152
148153
```bash
149154
kubectl get pods -n agentic-rag
@@ -165,27 +170,15 @@ kubectl get service agentic-rag -n agentic-rag
165170
166171
Access the application in your browser at `http://<EXTERNAL-IP>`.
167172
168-
169-
### Shell Script Deployment
170-
171-
For a quick start, use the deployment script. Just go into the script and replace your `HF_TOKEN` in line 17 (if you're planning on using `mistral-7b`, or leave it as-is if you're planning on using `ollama`):
172-
173-
```bash
174-
# Make the script executable
175-
chmod +x deploy.sh
176-
177-
# Deploy with a Hugging Face token
178-
./deploy.sh --hf-token "your-huggingface-token" --namespace agentic-rag
179-
180-
# Or deploy without a Hugging Face token (Ollama models only)
181-
./deploy.sh --namespace agentic-rag
182-
```
183-
184173
## Resource Requirements
185174
186175
The deployment of this solution requires the following minimum resources:
187176
188177
- **CPU**: 4+ cores
189178
- **Memory**: 16GB+ RAM
190179
- **Storage**: 50GB+
191-
- **GPU**: recommended for faster inference. In theory, you can use `mistral-7b` CPU-quantized models, but it will be sub-optimal.
180+
- **GPU**: recommended for faster inference. In theory, you can use `mistral-7b` CPU-quantized models, but it will be sub-optimal.
181+
182+
## Conclusion
183+
184+
You can check out the full AI solution and the deployment options we mention in this article in [the official GitHub repository](https://github.com/oracle-devrel/devrel-labs/tree/main/agentic_rag).

agentic_rag/k8s/README.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Kubernetes Deployment for Agentic RAG
2+
3+
This directory contains Kubernetes manifests and guides for deploying the Agentic RAG system on Kubernetes.
4+
5+
## Deployment Options
6+
7+
We currently provide a single deployment option with plans for a distributed deployment in the future:
8+
9+
### Local Deployment
10+
11+
This is a single-pod deployment where all components run in the same pod. It's simpler to deploy and manage, making it ideal for testing and development.
12+
13+
**Features:**
14+
- Includes both Hugging Face models and Ollama for inference
15+
- Uses GPU acceleration for faster inference
16+
- Simpler deployment and management
17+
- Easier debugging (all logs in one place)
18+
- Lower complexity (no inter-service communication)
19+
- Quicker setup
20+
21+
**Model Options:**
22+
- **Hugging Face Models**: Uses `Mistral-7B` models from Hugging Face (requires a token)
23+
- **Ollama Models**: Uses `ollama` for inference (llama3, phi3, qwen2)
24+
25+
### Future: Distributed System Deployment
26+
27+
A distributed system deployment that separates the LLM inference system into its own service is planned for future releases. This will allow for better resource allocation and scaling in production environments.
28+
29+
**Advantages:**
30+
- Independent scaling of components
31+
- Better resource optimization
32+
- Higher availability
33+
- Flexible model deployment
34+
- Load balancing capabilities
35+
36+
## Deployment Guides
37+
38+
We provide several guides for different environments:
39+
40+
1. [**General Kubernetes Guide**](README_k8s.md): Basic instructions for any Kubernetes cluster
41+
2. [**Oracle Kubernetes Engine (OKE) Guide**](OKE_DEPLOYMENT.md): Detailed instructions for deploying on OCI
42+
3. [**Minikube Guide**](MINIKUBE.md): Quick start guide for local testing with Minikube
43+
44+
## Directory Structure
45+
46+
```bash
47+
k8s/
48+
├── README_MAIN.md # This file
49+
├── README.md # General Kubernetes guide
50+
├── OKE_DEPLOYMENT.md # Oracle Kubernetes Engine guide
51+
├── MINIKUBE.md # Minikube guide
52+
├── deploy.sh # Deployment script
53+
└── local-deployment/ # Manifests for local deployment
54+
├── configmap.yaml
55+
├── deployment.yaml
56+
└── service.yaml
57+
```
58+
59+
## Quick Start
60+
61+
For a quick start, use the deployment script. Just go into the script and replace your `HF_TOKEN` in line 17:
62+
63+
```bash
64+
# Make the script executable
65+
chmod +x deploy.sh
66+
67+
# Deploy with a Hugging Face token
68+
./deploy.sh --hf-token "your-huggingface-token" --namespace agentic-rag
69+
70+
# Or deploy without a Hugging Face token (Ollama models only)
71+
./deploy.sh --namespace agentic-rag
72+
73+
# Deploy without GPU support (not recommended for production)
74+
./deploy.sh --cpu-only --namespace agentic-rag
75+
```
76+
77+
## Resource Requirements
78+
79+
The deployment requires the following minimum resources:
80+
81+
- **CPU**: 4+ cores
82+
- **Memory**: 16GB+ RAM
83+
- **Storage**: 50GB+
84+
- **GPU**: 1 NVIDIA GPU (required for optimal performance)
85+
86+
## Next Steps
87+
88+
After deployment, you can:
89+
90+
1. **Add Documents**: Upload PDFs, process web content, or add repositories to the knowledge base
91+
2. **Configure Models**: Download and configure different models
92+
3. **Customize**: Adjust the system to your specific needs
93+
4. **Scale**: For production use, consider implementing the distributed deployment with persistent storage (coming soon)
94+
95+
## Troubleshooting
96+
97+
See the specific deployment guides for troubleshooting tips. Common issues include:
98+
99+
- Insufficient resources
100+
- Network connectivity problems
101+
- Model download failures
102+
- Configuration errors
103+
- GPU driver issues
104+
105+
### GPU-Related Issues
106+
107+
If you encounter GPU-related issues:
108+
109+
1. **Check GPU availability**: Ensure your Kubernetes cluster has GPU nodes available
110+
2. **Verify NVIDIA drivers**: Make sure NVIDIA drivers are installed on the nodes
111+
3. **Check NVIDIA device plugin**: Ensure the NVIDIA device plugin is installed in your cluster
112+
4. **Inspect pod logs**: Check for GPU-related errors in the pod logs
113+
114+
```bash
115+
kubectl logs -f deployment/agentic-rag -n <namespace>
116+
```
117+
118+
## GPU Configuration Summary
119+
120+
The deployment has been configured to use GPU acceleration by default for optimal performance:
121+
122+
### Key GPU Configuration Changes
123+
124+
1. **Resource Requests and Limits**:
125+
- Each pod requests and is limited to 1 NVIDIA GPU
126+
- Memory and CPU resources have been increased to better support GPU workloads
127+
128+
2. **NVIDIA Container Support**:
129+
- The deployment installs NVIDIA drivers and CUDA in the container
130+
- Environment variables are set to enable GPU visibility and capabilities
131+
132+
3. **Ollama GPU Configuration**:
133+
- Ollama is configured to use GPU acceleration automatically
134+
- Models like llama3, phi3, and qwen2 will benefit from GPU acceleration
135+
136+
4. **Deployment Script Enhancements**:
137+
- Added GPU availability detection
138+
- Added `--cpu-only` flag for environments without GPUs
139+
- Provides guidance for GPU monitoring and troubleshooting
140+
141+
5. **Documentation Updates**:
142+
- Added GPU-specific instructions for different Kubernetes environments
143+
- Included troubleshooting steps for GPU-related issues
144+
- Updated resource requirements to reflect GPU needs
145+
146+
### CPU Fallback
147+
148+
While the deployment is optimized for GPU usage, a CPU-only mode is available using the `--cpu-only` flag with the deployment script. However, this is not recommended for production use as inference performance will be significantly slower.
149+
150+
## Contributing
151+
152+
Contributions to improve the deployment manifests and guides are welcome. Please submit a pull request or open an issue.

0 commit comments

Comments
 (0)