Skip to content

Commit d7179de

Browse files
Inference Gateway Feature Addition (#107)
* FEAT: Inference Gateway integration with Kong - Utilizes kong to allow routing via headers over a single URL for all APIs - Enables api keys for API endpoints for security - Provides a separate load balancer for APIs and control plane * Added thorough documentation and software versioning
1 parent 0bf705e commit d7179de

File tree

15 files changed

+731
-2
lines changed

15 files changed

+731
-2
lines changed

docs/api_documentation.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ For autoscaling parameters, visit [autoscaling](sample_blueprints/model_serving/
4242

4343
For multinode inference parameters, visit [multinode inference](sample_blueprints/model_serving/multi-node-inference/README.md)
4444

45+
for inference gateway parameters, visit [inference gateway](./sample_blueprints/platform_features/inference_gateway/README.md#blueprints-api-spec)
46+
4547
For MIG parameters, visit [MIG shared pool configurations](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json), [update MIG configuration](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json), and [MIG recipe configuration](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json).
4648

4749
### Blueprint Container Arguments

docs/custom_blueprints/blueprint_json_schema.json

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,42 @@
115115
]
116116
}
117117
},
118+
"recipe_inference_gateway": {
119+
"type": "object",
120+
"additionalProperties": false,
121+
"required": ["model_name"],
122+
"properties": {
123+
"model_name": {
124+
"type": "string",
125+
"description": "Name of model to identify it as a header in the inference gateway",
126+
"examples": ["Llama-4-Maverick"]
127+
},
128+
"url_path": {
129+
"type": "string",
130+
"description": "Additional path to add to ingress for inference gateway",
131+
"examples": ["/ai/models"]
132+
},
133+
"api_key": {
134+
"type": "string",
135+
"description": "API key to use for this model in the inference gateway",
136+
"examples": ["1234567890"]
137+
}
138+
},
139+
"examples": [
140+
{
141+
"model_name": "Llama-4-Maverick"
142+
},
143+
{
144+
"model_name": "Llama-4-Maverick",
145+
"url_path": "/models"
146+
},
147+
{
148+
"model_name": "Llama-4-Maverick",
149+
"url_path": "/ai/models",
150+
"api_key": "1234567890"
151+
}
152+
]
153+
},
118154
"recipe_container_command": {
119155
"type": "array",
120156
"items": {
Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
# Inference Gateway
2+
3+
#### Kong-powered API gateway for AI model inference routing and management
4+
5+
The Inference Gateway is a dedicated Kong-based API gateway that provides unified access, routing, and management capabilities for AI model inference endpoints deployed on the OCI AI Blueprints platform. This gateway serves as a centralized entry point for all inference requests, enabling advanced traffic management, load balancing, and API governance for your AI workloads.
6+
7+
## Pre-Filled Samples
8+
9+
| Feature Showcase | Title | Description | Blueprint File |
10+
| ------------------------------------------------------------------------------------------------------------- | ------------------------- | ----------------------------------------------------------------- | -------------------------------------------------------- |
11+
| Validates use of inference gateway feature set of extended url, model header based routing, and a per-model api-key | Serve OpenAI gpt-oss-120b on H100 GPUs behind inference gateway | Serve gpt-oss-120b model behind inference gateway on 2 NVIDIA H100 GPUs | [example_vllm_gpt_oss_120b.json](./example_vllm_gpt_oss_120b.json) |
12+
| Serve with other Llama-4 Maverick model behind inference gateway on same MI300x to validate header based routing and unique api-keys per model | Serve Llama4-Scout on MI300x GPUs behind inference gateway| Serve Llama-4-Scout-17B-16E-Instruct behind inference gateway on 4 AMD MI300x GPUs with extended url, header based routing, and model api-key | [example_vllm_llama4_scout.json](./example_vllm_llama4_scout.json) |
13+
| Serve with other Llama-4 Scout model behind inference gateway on same MI300x to validate header based routing and unique api-keys per model | Serve Llama4-Maverick on MI300x GPUs behind inference gateway | Serve Llama-4-Maverick-17B-128E-Instruct-FP8 behind inference gateway on 4 AMD MI300x GPUs with extended url, header based routing, and model api-key | [example_vllm_llama4_maverick.json](./example_vllm_llama4_maverick.json) |
14+
15+
16+
# What is the Inference Gateway?
17+
18+
The Inference Gateway leverages Kong Gateway to provide a robust, scalable API management layer specifically designed for AI model inference. It acts as a reverse proxy that sits between client applications and your deployed AI models, offering features like:
19+
20+
- **Unified API Endpoint**: Single point of access for all your deployed AI models
21+
- **Load Balancing**: Intelligent request distribution across multiple model instances
22+
- **Traffic Management**: Rate limiting, request routing, and performance optimization
23+
- **Security**: Authentication, authorization, and API key management
24+
- **Monitoring**: Request logging, metrics collection, and observability
25+
- **Protocol Translation**: Support for various API protocols and formats
26+
27+
## Key Features
28+
29+
### Kong Gateway Integration
30+
- **Version**: Kong Gateway 3.9 with Helm chart version 2.51.0
31+
- **Database-less Mode**: Operates in DB-less mode for simplified deployment and management
32+
- **Kubernetes Native**: Full integration with Kubernetes using Kong Ingress Controller
33+
- **Auto-scaling**: Configured with horizontal pod autoscaling (2-3 replicas, 70% CPU threshold)
34+
35+
### Network Configuration
36+
- **Load Balancer**: OCI flexible load balancer with configurable shapes (10-100 Mbps)
37+
- **Protocol Support**: Both HTTP (port 80) and HTTPS (port 443) endpoints
38+
- **Private Network Support**: Optional private load balancer for secure internal access
39+
- **External Access**: Automatic nip.io domain generation for easy external access
40+
41+
### Resource Management
42+
- **CPU**: 500m requests, 1000m limits per pod
43+
- **Memory**: 512Mi requests, 1Gi limits per pod
44+
- **High Availability**: Multiple replicas with automatic failover
45+
- **Performance Monitoring**: Built-in status endpoints for health checks
46+
47+
## Deployment Options
48+
49+
### Automatic Deployment (Default)
50+
When deploying OCI AI Blueprints, Kong is automatically installed and configured unless explicitly disabled:
51+
52+
```terraform
53+
# Kong is deployed by default
54+
bring_your_own_kong = false # Default value
55+
```
56+
57+
The system will:
58+
1. Deploy Kong Gateway in the `kong` namespace
59+
2. Configure OCI Load Balancer with flexible shape
60+
3. Set up automatic SSL/TLS termination
61+
4. Generate a publicly accessible URL: `https://<kong-ip>.nip.io`
62+
63+
### Bring Your Own Kong (BYOK)
64+
For existing clusters with Kong already installed:
65+
66+
```terraform
67+
# Use existing Kong installation
68+
bring_your_own_kong = true
69+
existent_kong_namespace = "your-kong-namespace"
70+
```
71+
72+
When using BYOK:
73+
- The platform will not deploy a new Kong instance
74+
- You must configure your existing Kong to route to deployed AI models
75+
- The inference gateway URL will show as disabled in outputs
76+
77+
## Configuration Details
78+
79+
### Service Configuration
80+
The Kong proxy service is configured with:
81+
82+
```yaml
83+
proxy:
84+
type: LoadBalancer
85+
annotations:
86+
service.beta.kubernetes.io/oci-load-balancer-shape: "flexible"
87+
service.beta.kubernetes.io/oci-load-balancer-shape-flex-min: "10"
88+
service.beta.kubernetes.io/oci-load-balancer-shape-flex-max: "100"
89+
http:
90+
servicePort: 80
91+
containerPort: 8000
92+
tls:
93+
servicePort: 443
94+
containerPort: 8443
95+
```
96+
97+
### Admin Interface
98+
Kong's admin interface is available internally for configuration management:
99+
- **HTTP Admin**: Port 8001 (ClusterIP)
100+
- **HTTPS Admin**: Port 8442 (ClusterIP)
101+
- **Status Endpoint**: Ports 8100/8101 for health monitoring
102+
103+
### Private Network Deployment
104+
For private clusters, the load balancer is automatically configured as internal:
105+
106+
```yaml
107+
proxy:
108+
annotations:
109+
service.beta.kubernetes.io/oci-load-balancer-internal: "true"
110+
```
111+
112+
## Usage Examples
113+
114+
### Blueprints API Spec:
115+
116+
To deploy your model behind the unified inference URL with Blueprints, the following API specification is required:
117+
118+
- `recipe_inference_gateway` (object) - the key required to encapsulate inference gateway features.
119+
- `model_name` (string) **required** - Model name which will be used as header to identify model behind the gateway - MUST BE UNIQUE per route as routes will fail if 2 models behind the same route share the same model name.
120+
- Example: `"model_name": "gpt-oss-120b"`
121+
- Usage: `curl -X POST <gateway url> -H "X-Model: gpt-oss-120b" ...`
122+
- `url_path` (string) **optional** - additional url path to add to gateway for this serving deployment
123+
- Example: `"url_path": "/ai/models"`
124+
- Usage: `curl -X POST http://10-76-0-10/ai/models ...`
125+
- `api_key` (string) **optional** - api key to use for this model in the inference gateway - MUST BE UNIQUE as these cannot be reused across models
126+
- Example: `"api_key": "123abc456ABC"`
127+
- Usage: `curl -X POST <gateway url> -H "apikey: 123abc456ABC" ...`
128+
129+
**Minimum Requirement**:
130+
131+
```json
132+
...
133+
"recipe_inference_gateway": {
134+
"model_name": "gpt-oss-120b"
135+
}
136+
...
137+
```
138+
139+
**All options enabled**
140+
```json
141+
...
142+
"recipe_inference_gateway": {
143+
"model_name": "gpt-oss-120b",
144+
"url_path": "/ai/models",
145+
"api_key": "123abc456ABC"
146+
}
147+
...
148+
```
149+
150+
## Example Blueprints Table
151+
152+
| Blueprint Name | Model | Shape | Description |
153+
| :------------: | :---: | :---: | :---------: |
154+
| [vLLM OpenAI/gpt-oss-120b H100](./example_vllm_gpt_oss_120b.json) | [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) | BM.GPU.H100.8 | Serves open source OpenAI Model with vLLM on NVIDIA H100 Bare metal host using 2 GPUs using the inference gateway |
155+
| [vLLM meta-llama/Llama-4-Scout-17B-16E-Instruct MI300x](./example_vllm_llama4_scout.json) |[meta-llama/Llama-4-Scout-17B-16E-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct) | BM.GPU.MI300X.8 | Serves open source Llama4-Scout Model with vLLM on AMD MI300x Bare metal host using 4 GPUs using the inference gateway with model stored on Local NVMe |
156+
| [vLLM meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 MI300x](./example_vllm_llama4_maverick.json) | [meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8) | BM.GPU.MI300X.8 | Serves open source Llama4-Maverick-fp8 Model with vLLM on AMD MI300x Bare metal host using 4 GPUs using the inference gateway with model stored on Local NVMe |
157+
158+
### Accessing Deployed Models
159+
Once the Inference Gateway is deployed, you can access your AI models through the unified endpoint. For example, if you had all 3 blueprints above deployed and your endpoint was https://140-10-23-76.nip.io you could access all 3 models like:
160+
161+
```bash
162+
curl -X POST https://140-10-23-76.nip.io/ai/models \
163+
-H "Content-Type: application/json" \
164+
-H "X-Model: "gpt-oss-120b"
165+
-H "apikey: "<api-key-from-gpt-blueprint>"
166+
-d '{"model": "openai/gpt-oss-120b", "messages": [{"role": "user", "content": "What is Kong Gateway?"}], "max_tokens": 200}'
167+
168+
curl -X POST https://140-10-23-76.nip.io/ai/models \
169+
-H "Content-Type: application/json" \
170+
-H "X-Model: "scout"
171+
-H "apikey: "<api-key-from-scout-blueprint>"
172+
-d '{"model": "Llama-4-Scout-17B-16E-Instruct", "messages": [{"role": "user", "content": "What is Kong Gateway?"}], "max_tokens": 200}'
173+
174+
curl -X POST https://140-10-23-76.nip.io/ai/models \
175+
-H "Content-Type: application/json" \
176+
-H "X-Model: "maverick"
177+
-H "apikey: "<api-key-from-maverick-blueprint>"
178+
-d '{"model": "Llama-4-Maverick-17B-128E-Instruct-FP8", "messages": [{"role": "user", "content": "What is Kong Gateway?"}], "max_tokens": 200}'
179+
```
180+
181+
### Health Check
182+
Monitor gateway health using the status endpoint (you may need to open a network port for this if desired):
183+
184+
```bash
185+
curl https://<kong-ip>.nip.io:8100/status
186+
```
187+
188+
## Security Considerations
189+
190+
### Network Security
191+
- Load balancer security groups restrict access to necessary ports
192+
- Private deployment option for internal-only access
193+
- TLS termination at the load balancer level
194+
195+
### API Security
196+
Kong provides extensive security features that can be configured:
197+
- API key authentication
198+
- Rate limiting and throttling <not implemented, post an issue if desired>
199+
- Single unified URL for all model endpoints
200+
201+
## Troubleshooting
202+
203+
### Common Issues
204+
205+
**Gateway URL shows as null in outputs**
206+
- Verify `bring_your_own_kong` is set to `false`
207+
- Check that Kong pods are running in the `kong` namespace visible in the blueprints portal
208+
- Ensure load balancer has been assigned an external IP
209+
210+
**Unable to access inference endpoints**
211+
- Verify security group rules allow traffic on ports 80/443
212+
- Check that target AI model services are running and healthy
213+
- Confirm Kong ingress rules are properly configured
214+
215+
**Performance issues**
216+
- Monitor resource utilization of Kong pods
217+
- Consider scaling up the load balancer shape
218+
- Review auto-scaling configuration
219+
220+
### Debug Commands on Kubernetes side
221+
222+
```bash
223+
# Check Kong deployment status
224+
kubectl get pods -n kong
225+
226+
# View Kong service details
227+
kubectl get svc kong-kong-proxy -n kong
228+
229+
# Check load balancer assignment
230+
kubectl describe svc kong-kong-proxy -n kong
231+
232+
# View Kong logs
233+
kubectl logs -n kong deployment/kong-kong
234+
```
235+
236+
## Version Information
237+
238+
- **Kong Gateway**: 3.9
239+
- **Helm Chart**: 2.51.0
240+
- **Repository**: https://charts.konghq.com
241+
- **OCI Integration**: Native OCI Load Balancer support
242+
243+
## Next Steps
244+
245+
After deploying the Inference Gateway:
246+
247+
1. **Configure Routes**: Set up Kong ingress resources for your AI models
248+
2. **Implement Security**: Configure authentication and rate limiting policies
249+
3. **Monitor Performance**: Set up alerting and monitoring dashboards
250+
4. **Scale Resources**: Adjust Kong replicas and load balancer shapes based on traffic
251+
252+
The Inference Gateway provides a production-ready foundation for managing AI model inference at scale, offering the flexibility and reliability needed for enterprise AI deployments on Oracle Cloud Infrastructure.
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{
2+
"recipe_id": "llm_inference_nvidia",
3+
"recipe_mode": "service",
4+
"deployment_name": "gpt-oss-tp2",
5+
"recipe_image_uri": "docker.io/vllm/vllm-openai:gptoss",
6+
"recipe_node_shape": "BM.GPU.H100.8",
7+
"recipe_replica_count": 1,
8+
"recipe_container_port": "8000",
9+
"recipe_nvidia_gpu_count": 2,
10+
"recipe_ephemeral_storage_size": 400,
11+
"recipe_shared_memory_volume_size_limit_in_mb": 32768,
12+
"recipe_use_shared_node_pool": true,
13+
"recipe_prometheus_enabled": true,
14+
"recipe_inference_gateway": {
15+
"model_name": "gpt-oss-120b",
16+
"url_path": "/ai/models",
17+
"api_key": "<any-api-key-here>"
18+
},
19+
"recipe_container_command_args": [
20+
"--model",
21+
"openai/gpt-oss-120b",
22+
"--tensor-parallel-size",
23+
"2",
24+
"--served-model-name",
25+
"openai/gpt-oss-120b"
26+
],
27+
"recipe_readiness_probe_params": {
28+
"endpoint_path": "/health",
29+
"port": 8000,
30+
"scheme": "HTTP",
31+
"initial_delay_seconds": 20,
32+
"period_seconds": 30,
33+
"success_threshold": 1,
34+
"timeout_seconds": 10
35+
},
36+
"recipe_liveness_probe_params": {
37+
"failure_threshold": 3,
38+
"endpoint_path": "/health",
39+
"port": 8000,
40+
"scheme": "HTTP",
41+
"initial_delay_seconds": 1200,
42+
"period_seconds": 60,
43+
"success_threshold": 1,
44+
"timeout_seconds": 10
45+
}
46+
}

0 commit comments

Comments
 (0)