|
| 1 | +# Inference Gateway |
| 2 | + |
| 3 | +#### Kong-powered API gateway for AI model inference routing and management |
| 4 | + |
| 5 | +The Inference Gateway is a dedicated Kong-based API gateway that provides unified access, routing, and management capabilities for AI model inference endpoints deployed on the OCI AI Blueprints platform. This gateway serves as a centralized entry point for all inference requests, enabling advanced traffic management, load balancing, and API governance for your AI workloads. |
| 6 | + |
| 7 | +## Pre-Filled Samples |
| 8 | + |
| 9 | +| Feature Showcase | Title | Description | Blueprint File | |
| 10 | +| ------------------------------------------------------------------------------------------------------------- | ------------------------- | ----------------------------------------------------------------- | -------------------------------------------------------- | |
| 11 | +| Validates use of inference gateway feature set of extended url, model header based routing, and a per-model api-key | Serve OpenAI gpt-oss-120b on H100 GPUs behind inference gateway | Serve gpt-oss-120b model behind inference gateway on 2 NVIDIA H100 GPUs | [example_vllm_gpt_oss_120b.json](./example_vllm_gpt_oss_120b.json) | |
| 12 | +| Serve with other Llama-4 Maverick model behind inference gateway on same MI300x to validate header based routing and unique api-keys per model | Serve Llama4-Scout on MI300x GPUs behind inference gateway| Serve Llama-4-Scout-17B-16E-Instruct behind inference gateway on 4 AMD MI300x GPUs with extended url, header based routing, and model api-key | [example_vllm_llama4_scout.json](./example_vllm_llama4_scout.json) | |
| 13 | +| Serve with other Llama-4 Scout model behind inference gateway on same MI300x to validate header based routing and unique api-keys per model | Serve Llama4-Maverick on MI300x GPUs behind inference gateway | Serve Llama-4-Maverick-17B-128E-Instruct-FP8 behind inference gateway on 4 AMD MI300x GPUs with extended url, header based routing, and model api-key | [example_vllm_llama4_maverick.json](./example_vllm_llama4_maverick.json) | |
| 14 | + |
| 15 | + |
| 16 | +# What is the Inference Gateway? |
| 17 | + |
| 18 | +The Inference Gateway leverages Kong Gateway to provide a robust, scalable API management layer specifically designed for AI model inference. It acts as a reverse proxy that sits between client applications and your deployed AI models, offering features like: |
| 19 | + |
| 20 | +- **Unified API Endpoint**: Single point of access for all your deployed AI models |
| 21 | +- **Load Balancing**: Intelligent request distribution across multiple model instances |
| 22 | +- **Traffic Management**: Rate limiting, request routing, and performance optimization |
| 23 | +- **Security**: Authentication, authorization, and API key management |
| 24 | +- **Monitoring**: Request logging, metrics collection, and observability |
| 25 | +- **Protocol Translation**: Support for various API protocols and formats |
| 26 | + |
| 27 | +## Key Features |
| 28 | + |
| 29 | +### Kong Gateway Integration |
| 30 | +- **Version**: Kong Gateway 3.9 with Helm chart version 2.51.0 |
| 31 | +- **Database-less Mode**: Operates in DB-less mode for simplified deployment and management |
| 32 | +- **Kubernetes Native**: Full integration with Kubernetes using Kong Ingress Controller |
| 33 | +- **Auto-scaling**: Configured with horizontal pod autoscaling (2-3 replicas, 70% CPU threshold) |
| 34 | + |
| 35 | +### Network Configuration |
| 36 | +- **Load Balancer**: OCI flexible load balancer with configurable shapes (10-100 Mbps) |
| 37 | +- **Protocol Support**: Both HTTP (port 80) and HTTPS (port 443) endpoints |
| 38 | +- **Private Network Support**: Optional private load balancer for secure internal access |
| 39 | +- **External Access**: Automatic nip.io domain generation for easy external access |
| 40 | + |
| 41 | +### Resource Management |
| 42 | +- **CPU**: 500m requests, 1000m limits per pod |
| 43 | +- **Memory**: 512Mi requests, 1Gi limits per pod |
| 44 | +- **High Availability**: Multiple replicas with automatic failover |
| 45 | +- **Performance Monitoring**: Built-in status endpoints for health checks |
| 46 | + |
| 47 | +## Deployment Options |
| 48 | + |
| 49 | +### Automatic Deployment (Default) |
| 50 | +When deploying OCI AI Blueprints, Kong is automatically installed and configured unless explicitly disabled: |
| 51 | + |
| 52 | +```terraform |
| 53 | +# Kong is deployed by default |
| 54 | +bring_your_own_kong = false # Default value |
| 55 | +``` |
| 56 | + |
| 57 | +The system will: |
| 58 | +1. Deploy Kong Gateway in the `kong` namespace |
| 59 | +2. Configure OCI Load Balancer with flexible shape |
| 60 | +3. Set up automatic SSL/TLS termination |
| 61 | +4. Generate a publicly accessible URL: `https://<kong-ip>.nip.io` |
| 62 | + |
| 63 | +### Bring Your Own Kong (BYOK) |
| 64 | +For existing clusters with Kong already installed: |
| 65 | + |
| 66 | +```terraform |
| 67 | +# Use existing Kong installation |
| 68 | +bring_your_own_kong = true |
| 69 | +existent_kong_namespace = "your-kong-namespace" |
| 70 | +``` |
| 71 | + |
| 72 | +When using BYOK: |
| 73 | +- The platform will not deploy a new Kong instance |
| 74 | +- You must configure your existing Kong to route to deployed AI models |
| 75 | +- The inference gateway URL will show as disabled in outputs |
| 76 | + |
| 77 | +## Configuration Details |
| 78 | + |
| 79 | +### Service Configuration |
| 80 | +The Kong proxy service is configured with: |
| 81 | + |
| 82 | +```yaml |
| 83 | +proxy: |
| 84 | + type: LoadBalancer |
| 85 | + annotations: |
| 86 | + service.beta.kubernetes.io/oci-load-balancer-shape: "flexible" |
| 87 | + service.beta.kubernetes.io/oci-load-balancer-shape-flex-min: "10" |
| 88 | + service.beta.kubernetes.io/oci-load-balancer-shape-flex-max: "100" |
| 89 | + http: |
| 90 | + servicePort: 80 |
| 91 | + containerPort: 8000 |
| 92 | + tls: |
| 93 | + servicePort: 443 |
| 94 | + containerPort: 8443 |
| 95 | +``` |
| 96 | +
|
| 97 | +### Admin Interface |
| 98 | +Kong's admin interface is available internally for configuration management: |
| 99 | +- **HTTP Admin**: Port 8001 (ClusterIP) |
| 100 | +- **HTTPS Admin**: Port 8442 (ClusterIP) |
| 101 | +- **Status Endpoint**: Ports 8100/8101 for health monitoring |
| 102 | +
|
| 103 | +### Private Network Deployment |
| 104 | +For private clusters, the load balancer is automatically configured as internal: |
| 105 | +
|
| 106 | +```yaml |
| 107 | +proxy: |
| 108 | + annotations: |
| 109 | + service.beta.kubernetes.io/oci-load-balancer-internal: "true" |
| 110 | +``` |
| 111 | +
|
| 112 | +## Usage Examples |
| 113 | +
|
| 114 | +### Blueprints API Spec: |
| 115 | +
|
| 116 | +To deploy your model behind the unified inference URL with Blueprints, the following API specification is required: |
| 117 | +
|
| 118 | + - `recipe_inference_gateway` (object) - the key required to encapsulate inference gateway features. |
| 119 | + - `model_name` (string) **required** - Model name which will be used as header to identify model behind the gateway - MUST BE UNIQUE per route as routes will fail if 2 models behind the same route share the same model name. |
| 120 | + - Example: `"model_name": "gpt-oss-120b"` |
| 121 | + - Usage: `curl -X POST <gateway url> -H "X-Model: gpt-oss-120b" ...` |
| 122 | + - `url_path` (string) **optional** - additional url path to add to gateway for this serving deployment |
| 123 | + - Example: `"url_path": "/ai/models"` |
| 124 | + - Usage: `curl -X POST http://10-76-0-10/ai/models ...` |
| 125 | + - `api_key` (string) **optional** - api key to use for this model in the inference gateway - MUST BE UNIQUE as these cannot be reused across models |
| 126 | + - Example: `"api_key": "123abc456ABC"` |
| 127 | + - Usage: `curl -X POST <gateway url> -H "apikey: 123abc456ABC" ...` |
| 128 | + |
| 129 | +**Minimum Requirement**: |
| 130 | + |
| 131 | +```json |
| 132 | + ... |
| 133 | + "recipe_inference_gateway": { |
| 134 | + "model_name": "gpt-oss-120b" |
| 135 | + } |
| 136 | + ... |
| 137 | +``` |
| 138 | + |
| 139 | +**All options enabled** |
| 140 | +```json |
| 141 | + ... |
| 142 | + "recipe_inference_gateway": { |
| 143 | + "model_name": "gpt-oss-120b", |
| 144 | + "url_path": "/ai/models", |
| 145 | + "api_key": "123abc456ABC" |
| 146 | + } |
| 147 | + ... |
| 148 | +``` |
| 149 | + |
| 150 | +## Example Blueprints Table |
| 151 | + |
| 152 | +| Blueprint Name | Model | Shape | Description | |
| 153 | +| :------------: | :---: | :---: | :---------: | |
| 154 | +| [vLLM OpenAI/gpt-oss-120b H100](./example_vllm_gpt_oss_120b.json) | [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | BM.GPU.H100.8 | Serves open source OpenAI Model with vLLM on NVIDIA H100 Bare metal host using 2 GPUs using the inference gateway | |
| 155 | +| [vLLM meta-llama/Llama-4-Scout-17B-16E-Instruct MI300x](./example_vllm_llama4_scout.json) |[meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | BM.GPU.MI300X.8 | Serves open source Llama4-Scout Model with vLLM on AMD MI300x Bare metal host using 4 GPUs using the inference gateway with model stored on Local NVMe | |
| 156 | +| [vLLM meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 MI300x](./example_vllm_llama4_maverick.json) | [meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8) | BM.GPU.MI300X.8 | Serves open source Llama4-Maverick-fp8 Model with vLLM on AMD MI300x Bare metal host using 4 GPUs using the inference gateway with model stored on Local NVMe | |
| 157 | + |
| 158 | +### Accessing Deployed Models |
| 159 | +Once the Inference Gateway is deployed, you can access your AI models through the unified endpoint. For example, if you had all 3 blueprints above deployed and your endpoint was https://140-10-23-76.nip.io you could access all 3 models like: |
| 160 | + |
| 161 | +```bash |
| 162 | +curl -X POST https://140-10-23-76.nip.io/ai/models \ |
| 163 | + -H "Content-Type: application/json" \ |
| 164 | + -H "X-Model: "gpt-oss-120b" |
| 165 | + -H "apikey: "<api-key-from-gpt-blueprint>" |
| 166 | + -d '{"model": "openai/gpt-oss-120b", "messages": [{"role": "user", "content": "What is Kong Gateway?"}], "max_tokens": 200}' |
| 167 | +
|
| 168 | +curl -X POST https://140-10-23-76.nip.io/ai/models \ |
| 169 | + -H "Content-Type: application/json" \ |
| 170 | + -H "X-Model: "scout" |
| 171 | + -H "apikey: "<api-key-from-scout-blueprint>" |
| 172 | + -d '{"model": "Llama-4-Scout-17B-16E-Instruct", "messages": [{"role": "user", "content": "What is Kong Gateway?"}], "max_tokens": 200}' |
| 173 | +
|
| 174 | +curl -X POST https://140-10-23-76.nip.io/ai/models \ |
| 175 | + -H "Content-Type: application/json" \ |
| 176 | + -H "X-Model: "maverick" |
| 177 | + -H "apikey: "<api-key-from-maverick-blueprint>" |
| 178 | + -d '{"model": "Llama-4-Maverick-17B-128E-Instruct-FP8", "messages": [{"role": "user", "content": "What is Kong Gateway?"}], "max_tokens": 200}' |
| 179 | +``` |
| 180 | + |
| 181 | +### Health Check |
| 182 | +Monitor gateway health using the status endpoint (you may need to open a network port for this if desired): |
| 183 | + |
| 184 | +```bash |
| 185 | +curl https://<kong-ip>.nip.io:8100/status |
| 186 | +``` |
| 187 | + |
| 188 | +## Security Considerations |
| 189 | + |
| 190 | +### Network Security |
| 191 | +- Load balancer security groups restrict access to necessary ports |
| 192 | +- Private deployment option for internal-only access |
| 193 | +- TLS termination at the load balancer level |
| 194 | + |
| 195 | +### API Security |
| 196 | +Kong provides extensive security features that can be configured: |
| 197 | +- API key authentication |
| 198 | +- Rate limiting and throttling <not implemented, post an issue if desired> |
| 199 | +- Single unified URL for all model endpoints |
| 200 | + |
| 201 | +## Troubleshooting |
| 202 | + |
| 203 | +### Common Issues |
| 204 | + |
| 205 | +**Gateway URL shows as null in outputs** |
| 206 | +- Verify `bring_your_own_kong` is set to `false` |
| 207 | +- Check that Kong pods are running in the `kong` namespace visible in the blueprints portal |
| 208 | +- Ensure load balancer has been assigned an external IP |
| 209 | + |
| 210 | +**Unable to access inference endpoints** |
| 211 | +- Verify security group rules allow traffic on ports 80/443 |
| 212 | +- Check that target AI model services are running and healthy |
| 213 | +- Confirm Kong ingress rules are properly configured |
| 214 | + |
| 215 | +**Performance issues** |
| 216 | +- Monitor resource utilization of Kong pods |
| 217 | +- Consider scaling up the load balancer shape |
| 218 | +- Review auto-scaling configuration |
| 219 | + |
| 220 | +### Debug Commands on Kubernetes side |
| 221 | + |
| 222 | +```bash |
| 223 | +# Check Kong deployment status |
| 224 | +kubectl get pods -n kong |
| 225 | +
|
| 226 | +# View Kong service details |
| 227 | +kubectl get svc kong-kong-proxy -n kong |
| 228 | +
|
| 229 | +# Check load balancer assignment |
| 230 | +kubectl describe svc kong-kong-proxy -n kong |
| 231 | +
|
| 232 | +# View Kong logs |
| 233 | +kubectl logs -n kong deployment/kong-kong |
| 234 | +``` |
| 235 | + |
| 236 | +## Version Information |
| 237 | + |
| 238 | +- **Kong Gateway**: 3.9 |
| 239 | +- **Helm Chart**: 2.51.0 |
| 240 | +- **Repository**: https://charts.konghq.com |
| 241 | +- **OCI Integration**: Native OCI Load Balancer support |
| 242 | + |
| 243 | +## Next Steps |
| 244 | + |
| 245 | +After deploying the Inference Gateway: |
| 246 | + |
| 247 | +1. **Configure Routes**: Set up Kong ingress resources for your AI models |
| 248 | +2. **Implement Security**: Configure authentication and rate limiting policies |
| 249 | +3. **Monitor Performance**: Set up alerting and monitoring dashboards |
| 250 | +4. **Scale Resources**: Adjust Kong replicas and load balancer shapes based on traffic |
| 251 | + |
| 252 | +The Inference Gateway provides a production-ready foundation for managing AI model inference at scale, offering the flexibility and reliability needed for enterprise AI deployments on Oracle Cloud Infrastructure. |
0 commit comments