|
| 1 | +# Kubernetes Deployment for Agentic RAG |
| 2 | + |
| 3 | +This directory contains Kubernetes manifests and guides for deploying the Agentic RAG system on Kubernetes. |
| 4 | + |
| 5 | +## Deployment Options |
| 6 | + |
| 7 | +We currently provide a single deployment option with plans for a distributed deployment in the future: |
| 8 | + |
| 9 | +### Local Deployment |
| 10 | + |
| 11 | +This is a single-pod deployment where all components run in the same pod. It's simpler to deploy and manage, making it ideal for testing and development. |
| 12 | + |
| 13 | +**Features:** |
| 14 | +- Includes both Hugging Face models and Ollama for inference |
| 15 | +- Uses GPU acceleration for faster inference |
| 16 | +- Simpler deployment and management |
| 17 | +- Easier debugging (all logs in one place) |
| 18 | +- Lower complexity (no inter-service communication) |
| 19 | +- Quicker setup |
| 20 | + |
| 21 | +**Model Options:** |
| 22 | +- **Hugging Face Models**: Uses `Mistral-7B` models from Hugging Face (requires a token) |
| 23 | +- **Ollama Models**: Uses `ollama` for inference (llama3, phi3, qwen2) |
| 24 | + |
| 25 | +### Future: Distributed System Deployment |
| 26 | + |
| 27 | +A distributed system deployment that separates the LLM inference system into its own service is planned for future releases. This will allow for better resource allocation and scaling in production environments. |
| 28 | + |
| 29 | +**Advantages:** |
| 30 | +- Independent scaling of components |
| 31 | +- Better resource optimization |
| 32 | +- Higher availability |
| 33 | +- Flexible model deployment |
| 34 | +- Load balancing capabilities |
| 35 | + |
| 36 | +## Deployment Guides |
| 37 | + |
| 38 | +We provide several guides for different environments: |
| 39 | + |
| 40 | +1. [**General Kubernetes Guide**](README_k8s.md): Basic instructions for any Kubernetes cluster |
| 41 | +2. [**Oracle Kubernetes Engine (OKE) Guide**](OKE_DEPLOYMENT.md): Detailed instructions for deploying on OCI |
| 42 | +3. [**Minikube Guide**](MINIKUBE.md): Quick start guide for local testing with Minikube |
| 43 | + |
| 44 | +## Directory Structure |
| 45 | + |
| 46 | +```bash |
| 47 | +k8s/ |
| 48 | +├── README_MAIN.md # This file |
| 49 | +├── README.md # General Kubernetes guide |
| 50 | +├── OKE_DEPLOYMENT.md # Oracle Kubernetes Engine guide |
| 51 | +├── MINIKUBE.md # Minikube guide |
| 52 | +├── deploy.sh # Deployment script |
| 53 | +└── local-deployment/ # Manifests for local deployment |
| 54 | + ├── configmap.yaml |
| 55 | + ├── deployment.yaml |
| 56 | + └── service.yaml |
| 57 | +``` |
| 58 | + |
| 59 | +## Quick Start |
| 60 | + |
| 61 | +For a quick start, use the deployment script. Just go into the script and replace your `HF_TOKEN` in line 17: |
| 62 | + |
| 63 | +```bash |
| 64 | +# Make the script executable |
| 65 | +chmod +x deploy.sh |
| 66 | + |
| 67 | +# Deploy with a Hugging Face token |
| 68 | +./deploy.sh --hf-token "your-huggingface-token" --namespace agentic-rag |
| 69 | + |
| 70 | +# Or deploy without a Hugging Face token (Ollama models only) |
| 71 | +./deploy.sh --namespace agentic-rag |
| 72 | + |
| 73 | +# Deploy without GPU support (not recommended for production) |
| 74 | +./deploy.sh --cpu-only --namespace agentic-rag |
| 75 | +``` |
| 76 | + |
| 77 | +## Resource Requirements |
| 78 | + |
| 79 | +The deployment requires the following minimum resources: |
| 80 | + |
| 81 | +- **CPU**: 4+ cores |
| 82 | +- **Memory**: 16GB+ RAM |
| 83 | +- **Storage**: 50GB+ |
| 84 | +- **GPU**: 1 NVIDIA GPU (required for optimal performance) |
| 85 | + |
| 86 | +## Next Steps |
| 87 | + |
| 88 | +After deployment, you can: |
| 89 | + |
| 90 | +1. **Add Documents**: Upload PDFs, process web content, or add repositories to the knowledge base |
| 91 | +2. **Configure Models**: Download and configure different models |
| 92 | +3. **Customize**: Adjust the system to your specific needs |
| 93 | +4. **Scale**: For production use, consider implementing the distributed deployment with persistent storage (coming soon) |
| 94 | + |
| 95 | +## Troubleshooting |
| 96 | + |
| 97 | +See the specific deployment guides for troubleshooting tips. Common issues include: |
| 98 | + |
| 99 | +- Insufficient resources |
| 100 | +- Network connectivity problems |
| 101 | +- Model download failures |
| 102 | +- Configuration errors |
| 103 | +- GPU driver issues |
| 104 | + |
| 105 | +### GPU-Related Issues |
| 106 | + |
| 107 | +If you encounter GPU-related issues: |
| 108 | + |
| 109 | +1. **Check GPU availability**: Ensure your Kubernetes cluster has GPU nodes available |
| 110 | +2. **Verify NVIDIA drivers**: Make sure NVIDIA drivers are installed on the nodes |
| 111 | +3. **Check NVIDIA device plugin**: Ensure the NVIDIA device plugin is installed in your cluster |
| 112 | +4. **Inspect pod logs**: Check for GPU-related errors in the pod logs |
| 113 | + |
| 114 | +```bash |
| 115 | +kubectl logs -f deployment/agentic-rag -n <namespace> |
| 116 | +``` |
| 117 | + |
| 118 | +## GPU Configuration Summary |
| 119 | + |
| 120 | +The deployment has been configured to use GPU acceleration by default for optimal performance: |
| 121 | + |
| 122 | +### Key GPU Configuration Changes |
| 123 | + |
| 124 | +1. **Resource Requests and Limits**: |
| 125 | + - Each pod requests and is limited to 1 NVIDIA GPU |
| 126 | + - Memory and CPU resources have been increased to better support GPU workloads |
| 127 | + |
| 128 | +2. **NVIDIA Container Support**: |
| 129 | + - The deployment installs NVIDIA drivers and CUDA in the container |
| 130 | + - Environment variables are set to enable GPU visibility and capabilities |
| 131 | + |
| 132 | +3. **Ollama GPU Configuration**: |
| 133 | + - Ollama is configured to use GPU acceleration automatically |
| 134 | + - Models like llama3, phi3, and qwen2 will benefit from GPU acceleration |
| 135 | + |
| 136 | +4. **Deployment Script Enhancements**: |
| 137 | + - Added GPU availability detection |
| 138 | + - Added `--cpu-only` flag for environments without GPUs |
| 139 | + - Provides guidance for GPU monitoring and troubleshooting |
| 140 | + |
| 141 | +5. **Documentation Updates**: |
| 142 | + - Added GPU-specific instructions for different Kubernetes environments |
| 143 | + - Included troubleshooting steps for GPU-related issues |
| 144 | + - Updated resource requirements to reflect GPU needs |
| 145 | + |
| 146 | +### CPU Fallback |
| 147 | + |
| 148 | +While the deployment is optimized for GPU usage, a CPU-only mode is available using the `--cpu-only` flag with the deployment script. However, this is not recommended for production use as inference performance will be significantly slower. |
| 149 | + |
| 150 | +## Contributing |
| 151 | + |
| 152 | +Contributions to improve the deployment manifests and guides are welcome. Please submit a pull request or open an issue. |
0 commit comments