feat: added final version of kubernetes rag article and readme for k8s/ dir

jasperan · jasperan · commit d3261e39edab · 2025-03-19T19:58:04.000+01:00
diff --git a/agentic_rag/articles/kubernetes_rag.md b/agentic_rag/articles/kubernetes_rag.md
@@ -62,36 +62,40 @@ This method is the easiest way to implement and deploy. We call it local because
 
 ### Distributed System Deployment
 
-By decoupling the `ollama` LLM inference system to another pod, we can easily ready our system for **vertical scaling**: if we're ever running out of resources or we need to use a bigger model, we don't have to worry about the other solution components not having enough resources for processing and logging: we can simply scale up our inference pod and connect it via a FastAPI or similar system to allow the Gradio interface to make calls to the model, following a distributed system architecture.
+By decoupling the `ollama` LLM inference system to another pod, we could easily ready our system for **vertical scaling**: if we're ever running out of resources or we need to use a bigger model, we don't have to worry about the other solution components not having enough resources for processing and logging: we can simply scale up our inference pod and connect it via a FastAPI or similar system to allow the Gradio interface to make calls to the model, following a distributed system architecture.
 
 The advantages are:
 
 - **Independent Scaling**: Each component can be scaled according to its specific resource needs
 - **Resource Optimization**: Dedicated resources for compute-intensive LLM inference separate from other components
 - **High Availability**: System remains operational even if individual components fail, and we can have multiple pods running failover LLMs to help us with disaster recovery.
-- **Flexible Model Deployment**: Easily swap or upgrade LLM models without affecting the rest of the system (also, with 0 downtime!)
+- **Flexible Model Deployment**: Easily swap or upgrade LLM models without affecting the rest of the system (also, with virtually zero downtime!)
 - **Load Balancing**: Distribute inference requests across multiple LLM pods for better performance, thus allowing concurrent users in our Gradio interface.
 - **Isolation**: Performance issues on the LLM side won't impact the interface
 - **Cost Efficiency**: Allocate expensive GPU resources only where needed (inference) while using cheaper CPU resources for other components (e.g. we use GPU for Chain of Thought reasoning, while keeping a quantized CPU LLM for standard chatting).
 
 ## Quick Start
 
-### Step by Step Deployment
+For this solution, we have currently implemented the local system deployment, which is what we'll cover in this section.
 
-0. Clone the repository containing the Kubernetes manifests:
+First, we need to create a GPU OKE cluster with `zx` and Terraform. For this, you can follow the steps in [this repository](https://github.com/vmleon/oci-oke-gpu), or reuse your own Kubernetes cluster if you happen to already have one.
 
-```bash
-git clone https://github.com/oracle-devrel/devrel-labs.git
-cd devrel-labs/agentic_rag/k8s
-```
+Then, we can start setting up the solution in our cluster by following these steps.
 
-1. Create a namespace:
+1. Clone the repository containing the Kubernetes manifests:
 
-```bash
-kubectl create namespace agentic-rag
-```
+  ```bash
+  git clone https://github.com/oracle-devrel/devrel-labs.git
+  cd devrel-labs/agentic_rag/k8s
+  ```
+
+2. Create a namespace:
+
+  ```bash
+  kubectl create namespace agentic-rag
+  ```
 
-2. Create a ConfigMap:
+3. Create a ConfigMap:
 
   This step will help our deployment for several reasons:
 
@@ -134,16 +138,17 @@ kubectl create namespace agentic-rag
 
   This approach makes our deployment more flexible, secure, and maintainable compared to hardcoding configuration values.
 
-3. Apply the manifests:
+4. Apply the manifests:
 
   ```bash
+  kubectl apply -n agentic-rag -f local-deployment/pvcs.yaml
   kubectl apply -n agentic-rag -f local-deployment/deployment.yaml
   kubectl apply -n agentic-rag -f local-deployment/service.yaml
   ```
 
-4. Monitor the Deployment
+5. Monitor the Deployment
 
-  With the following commands, we can check the status of our pods:
+  With the following commands, we can check the status of our pod:
 
   ```bash
   kubectl get pods -n agentic-rag
@@ -165,27 +170,15 @@ kubectl get service agentic-rag -n agentic-rag
 
 Access the application in your browser at `http://<EXTERNAL-IP>`.
 
-
-### Shell Script Deployment
-
-For a quick start, use the deployment script. Just go into the script and replace your `HF_TOKEN` in line 17 (if you're planning on using `mistral-7b`, or leave it as-is if you're planning on using `ollama`):
-
-```bash
-# Make the script executable
-chmod +x deploy.sh
-
-# Deploy with a Hugging Face token
-./deploy.sh --hf-token "your-huggingface-token" --namespace agentic-rag
-
-# Or deploy without a Hugging Face token (Ollama models only)
-./deploy.sh --namespace agentic-rag
-```
-
 ## Resource Requirements
 
 The deployment of this solution requires the following minimum resources:
 
 - **CPU**: 4+ cores
 - **Memory**: 16GB+ RAM
 - **Storage**: 50GB+
-- **GPU**: recommended for faster inference. In theory, you can use `mistral-7b` CPU-quantized models, but it will be sub-optimal.
+- **GPU**: recommended for faster inference. In theory, you can use `mistral-7b` CPU-quantized models, but it will be sub-optimal.
+
+## Conclusion
+
+You can check out the full AI solution and the deployment options we mention in this article in [the official GitHub repository](https://github.com/oracle-devrel/devrel-labs/tree/main/agentic_rag).
diff --git a/agentic_rag/k8s/README.md b/agentic_rag/k8s/README.md
@@ -0,0 +1,152 @@
+# Kubernetes Deployment for Agentic RAG
+
+This directory contains Kubernetes manifests and guides for deploying the Agentic RAG system on Kubernetes.
+
+## Deployment Options
+
+We currently provide a single deployment option with plans for a distributed deployment in the future:
+
+### Local Deployment
+
+This is a single-pod deployment where all components run in the same pod. It's simpler to deploy and manage, making it ideal for testing and development.
+
+**Features:**
+- Includes both Hugging Face models and Ollama for inference
+- Uses GPU acceleration for faster inference
+- Simpler deployment and management
+- Easier debugging (all logs in one place)
+- Lower complexity (no inter-service communication)
+- Quicker setup
+
+**Model Options:**
+- **Hugging Face Models**: Uses `Mistral-7B` models from Hugging Face (requires a token)
+- **Ollama Models**: Uses `ollama` for inference (llama3, phi3, qwen2)
+
+### Future: Distributed System Deployment
+
+A distributed system deployment that separates the LLM inference system into its own service is planned for future releases. This will allow for better resource allocation and scaling in production environments.
+
+**Advantages:**
+- Independent scaling of components
+- Better resource optimization
+- Higher availability
+- Flexible model deployment
+- Load balancing capabilities
+
+## Deployment Guides
+
+We provide several guides for different environments:
+
+1. [**General Kubernetes Guide**](README_k8s.md): Basic instructions for any Kubernetes cluster
+2. [**Oracle Kubernetes Engine (OKE) Guide**](OKE_DEPLOYMENT.md): Detailed instructions for deploying on OCI
+3. [**Minikube Guide**](MINIKUBE.md): Quick start guide for local testing with Minikube
+
+## Directory Structure
+
+```bash
+k8s/
+├── README_MAIN.md           # This file
+├── README.md                # General Kubernetes guide
+├── OKE_DEPLOYMENT.md        # Oracle Kubernetes Engine guide
+├── MINIKUBE.md              # Minikube guide
+├── deploy.sh                # Deployment script
+└── local-deployment/        # Manifests for local deployment
+    ├── configmap.yaml
+    ├── deployment.yaml
+    └── service.yaml
+```
+
+## Quick Start
+
+For a quick start, use the deployment script. Just go into the script and replace your `HF_TOKEN` in line 17:
+
+```bash
+# Make the script executable
+chmod +x deploy.sh
+
+# Deploy with a Hugging Face token
+./deploy.sh --hf-token "your-huggingface-token" --namespace agentic-rag
+
+# Or deploy without a Hugging Face token (Ollama models only)
+./deploy.sh --namespace agentic-rag
+
+# Deploy without GPU support (not recommended for production)
+./deploy.sh --cpu-only --namespace agentic-rag
+```
+
+## Resource Requirements
+
+The deployment requires the following minimum resources:
+
+- **CPU**: 4+ cores
+- **Memory**: 16GB+ RAM
+- **Storage**: 50GB+
+- **GPU**: 1 NVIDIA GPU (required for optimal performance)
+
+## Next Steps
+
+After deployment, you can:
+
+1. **Add Documents**: Upload PDFs, process web content, or add repositories to the knowledge base
+2. **Configure Models**: Download and configure different models
+3. **Customize**: Adjust the system to your specific needs
+4. **Scale**: For production use, consider implementing the distributed deployment with persistent storage (coming soon)
+
+## Troubleshooting
+
+See the specific deployment guides for troubleshooting tips. Common issues include:
+
+- Insufficient resources
+- Network connectivity problems
+- Model download failures
+- Configuration errors
+- GPU driver issues
+
+### GPU-Related Issues
+
+If you encounter GPU-related issues:
+
+1. **Check GPU availability**: Ensure your Kubernetes cluster has GPU nodes available
+2. **Verify NVIDIA drivers**: Make sure NVIDIA drivers are installed on the nodes
+3. **Check NVIDIA device plugin**: Ensure the NVIDIA device plugin is installed in your cluster
+4. **Inspect pod logs**: Check for GPU-related errors in the pod logs
+
+```bash
+kubectl logs -f deployment/agentic-rag -n <namespace>
+```
+
+## GPU Configuration Summary
+
+The deployment has been configured to use GPU acceleration by default for optimal performance:
+
+### Key GPU Configuration Changes
+
+1. **Resource Requests and Limits**:
+   - Each pod requests and is limited to 1 NVIDIA GPU
+   - Memory and CPU resources have been increased to better support GPU workloads
+
+2. **NVIDIA Container Support**:
+   - The deployment installs NVIDIA drivers and CUDA in the container
+   - Environment variables are set to enable GPU visibility and capabilities
+
+3. **Ollama GPU Configuration**:
+   - Ollama is configured to use GPU acceleration automatically
+   - Models like llama3, phi3, and qwen2 will benefit from GPU acceleration
+
+4. **Deployment Script Enhancements**:
+   - Added GPU availability detection
+   - Added `--cpu-only` flag for environments without GPUs
+   - Provides guidance for GPU monitoring and troubleshooting
+
+5. **Documentation Updates**:
+   - Added GPU-specific instructions for different Kubernetes environments
+   - Included troubleshooting steps for GPU-related issues
+   - Updated resource requirements to reflect GPU needs
+
+### CPU Fallback
+
+While the deployment is optimized for GPU usage, a CPU-only mode is available using the `--cpu-only` flag with the deployment script. However, this is not recommended for production use as inference performance will be significantly slower.
+
+## Contributing
+
+Contributions to improve the deployment manifests and guides are welcome. Please submit a pull request or open an issue.