[Design Proposal] AI Gateway Management for Agent Manager #285
menakaj
started this conversation in
Design Proposals
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Gateway Management Implementation Proposal
Problem Statement
The Agent Manager platform needs a comprehensive gateway management system that enables organizations to:
Register and manage AI gateway instances - Organizations deploy their own gateway infrastructure and need to register these instances with the Agent Manager control plane for centralized management.
Environment-based gateway organization - Gateways must be organized by deployment environments (development, staging, production) to support environment-specific deployment workflows and isolation.
Secure gateway authentication - Gateways require secure, token-based authentication to communicate with the control plane and receive deployment events.
Real-time deployment orchestration - When LLM providers, proxies, or APIs are deployed, the control plane must push these configurations to the appropriate gateways in real-time via WebSocket connections.
Multi-gateway deployment support - A single LLM provider or API can be deployed to multiple gateways simultaneously, enabling horizontal scaling and multi-region deployments.
Gateway health and status tracking - The platform must track which gateways are online, their connection status, and deployed artifacts for operational visibility.
User Stories
Platform Administrator
As a platform administrator, I want to register gateway instances with their connection details so that Agent Manager can orchestrate deployments to them.
As a platform administrator, I want to organize gateways into environments (dev, staging, prod) so that I can deploy resources to all gateways in an environment with a single operation.
As a platform administrator, I want to generate and rotate authentication tokens for gateways so that I can maintain secure communication channels.
As a platform administrator, I want to monitor gateway connection status from a central location so that I can quickly identify connectivity issues.
As a platform administrator, I want to view which LLM providers and APIs are deployed to each gateway so that I can understand deployment topology.
Development Team
As a developer, I want to deploy LLM providers to specific environments (e.g., "deploy to all staging gateways") so that I can test integrations before production.
As a developer, I want the platform to automatically push configuration changes to connected gateways so that deployments are immediate and consistent.
As a developer, I want to deploy a single LLM provider to multiple gateways simultaneously so that I can support load balancing and high availability.
Gateway Operators
As a gateway operator, I want my gateway to establish a persistent WebSocket connection to the control plane so that it receives deployment events in real-time.
As a gateway operator, I want my gateway to authenticate using a secure token so that only authorized gateways can connect to the control plane.
As a gateway operator, I want my gateway to automatically receive LLM provider configurations when they're deployed so that I don't need manual intervention.
Architecture Overview
The implemented solution uses a centralized control plane architecture where Agent Manager serves as the single source of truth for gateway configurations and orchestrates all deployments via WebSocket-based event streaming.
Key Components
Design Principles
1. Control Plane / Data Plane Separation
Agent Manager acts as the control plane for policy decisions and orchestration. Gateways (running Envoy) act as the data plane for request routing and policy enforcement. Configuration flows from control plane → data plane via WebSocket events.
2. Event-Driven Architecture
All deployment operations trigger events that are pushed to connected gateways. This eliminates polling overhead and enables real-time configuration updates.
3. Gateway-First Security
Gateways authenticate to the control plane (not vice versa). Each gateway uses a unique, salted token hash for authentication. Tokens can be rotated without service interruption (max 2 active tokens).
4. Environment Abstraction
Environments (dev/staging/prod) are first-class entities. Gateway-environment mappings support many-to-many relationships (one gateway in multiple environments, or multiple gateways in one environment).
5. Functionality-Based Gateway Types
Gateways are typed by functionality:
Data Model
Core Tables
1.
gatewaysRepresents a registered gateway instance within an organization.
Key Fields:
is_active- Set totruewhen gateway establishes WebSocket connectionis_critical- Flag for critical production gateways requiring special monitoringproperties- JSONB field for gateway-specific metadata and configurationgateway_functionality_type- Determines what artifacts can be deployed2.
gateway_tokensAuthentication tokens for gateway-to-control-plane communication.
Security Model:
3.
environmentsLogical grouping of gateways by deployment stage.
Purpose:
Environments enable environment-level deployment operations. When deploying an LLM provider to an environment, it's automatically deployed to all gateways mapped to that environment.
4.
gateway_environment_mappingsMany-to-many relationship between gateways and environments.
Use Cases:
5.
llm_provider_deploymentsTracks which LLM providers are deployed to which gateways.
Status Lifecycle:
pending- Deployment event sent to gateway, awaiting confirmationdeployed- Gateway confirmed successful deploymentfailed- Gateway reported deployment failureundeployed- Removed from gatewayAPI Design
Gateway Management Endpoints
All gateway endpoints are scoped to organizations and use the pattern:
/orgs/{orgName}/gateways/*1. Register Gateway
Response:
{ "uuid": "gateway-uuid", "organizationName": "acme", "name": "prod-gateway-1", "displayName": "Production Gateway 1", "gatewayType": "AI", "vhost": "gateway.example.com", "region": "us-east-1", "isCritical": true, "status": "INACTIVE", "createdAt": "2026-02-13T10:00:00Z", "updatedAt": "2026-02-13T10:00:00Z", "environments": [ { "id": "env-uuid-1", "name": "production", "displayName": "Production" } ] }Behavior:
environmentIdsprovidedis_active = falseuntil gateway connects via WebSocket2. List Gateways
Query Parameters:
type- Filter by gateway type (AI, Regular, Event)environment- Filter by environment namestatus- Filter by status (ACTIVE, INACTIVE)limit,offset- Pagination3. Get Gateway Details
4. Update Gateway
Updatable Fields:
displayName- Human-readable nameisCritical- Critical flagstatus- Gateway status (ACTIVE, INACTIVE)Immutable Fields:
name- Gateway identifiergatewayType- Functionality typevhost- Virtual hostorganizationUUID- Ownership5. Delete Gateway
Behavior:
deleted_attimestamp)Token Management Endpoints
1. Rotate Gateway Token
Security Rules:
2. Revoke Gateway Token
Behavior:
status = 'revoked'andrevoked_at = NOW()Environment Management Endpoints
1. Create Environment
2. Map Gateway to Environment
Behavior:
3. Remove Gateway from Environment
4. List Gateway Environments
5. List Environment Gateways
Gateway Internal API
These endpoints are called by gateways (not by users). They use gateway token authentication instead of JWT.
1. Get LLM Provider Details
Purpose:
When a gateway receives a deployment event, it calls this endpoint to fetch the full LLM provider configuration and OpenAPI spec for dynamic route creation.
2. Report Deployment Status
Status Values:
deployed- Successfully applied configurationfailed- Deployment failed (includes error message)undeployed- Successfully removed configurationWebSocket Architecture
Connection Lifecycle
1. Gateway Connection Establishment
Handshake Flow:
2. Heartbeat Mechanism
3. Connection Termination
Graceful Shutdown:
Unexpected Disconnect:
Event Types
1. LLM Provider Deployment Event
Trigger: User calls
POST /orgs/{orgName}/llm-providers/{providerId}/deployEvent Payload:
{ "type": "llm.deployed", "payload": { "llmProviderId": "provider-uuid", "deploymentId": "deployment-uuid", "environmentId": "env-uuid", "gatewayId": "gateway-uuid" }, "timestamp": "2026-02-13T12:00:00Z", "correlationId": "correlation-uuid" }Gateway Action:
GET /internal/llm-providers/{providerId}to fetch configurationPOST /internal/deployments/{deploymentId}/statuswith result2. LLM Provider Undeployment Event
Trigger: User calls
DELETE /orgs/{orgName}/llm-providers/{providerId}/deployments/{deploymentId}Event Payload:
{ "type": "llm.undeployed", "payload": { "llmProviderId": "provider-uuid", "deploymentId": "deployment-uuid", "gatewayId": "gateway-uuid" }, "timestamp": "2026-02-13T12:05:00Z", "correlationId": "correlation-uuid" }Gateway Action:
POST /internal/deployments/{deploymentId}/statuswith status=undeployed3. API Deployment Event (Future)
{ "type": "api.deployed", "payload": { "apiId": "api-uuid", "deploymentId": "deployment-uuid", "gatewayId": "gateway-uuid" }, ... }Multi-Gateway Deployment Flow
When deploying to an environment (not a specific gateway), the control plane broadcasts to all gateways in that environment.
Example:
Deploy LLM provider to "production" environment with 3 gateways.
Partial Failure Handling:
If deployment succeeds on gateway-1 and gateway-2 but fails on gateway-3, the deployment is marked as "PARTIALLY_DEPLOYED" with detailed per-gateway status.
Security Model
Authentication
User Requests: JWT-based authentication (existing Agent Manager auth)
Gateway Requests: Token-based authentication using gateway tokens
Token Security
Storage:
gateway_tokens.token_hashgateway_tokens.saltVerification:
Rotation:
Authorization
Gateway Scoping:
When a gateway connects via WebSocket, its identity is bound to the connection. The gateway can only:
Organization Isolation:
All gateway operations are scoped to organizations. Gateways cannot access resources from other organizations.
Deployment Orchestration
LLM Provider Deployment
User-Initiated Flow:
Offline Gateway Handling:
If a gateway is offline (
is_active = false) when deployment is initiated:status = pendingMulti-Gateway Deployment
Example Scenario:
Deploy
openai-providerto environment "production" which has gateways: prod-us-east, prod-us-west, prod-eu-west.Implementation Details
Technology Stack
Control Plane:
Data Plane (Gateway):
Key Services
1. Platform Gateway Service
Responsibilities:
Methods:
2. WebSocket Manager
Responsibilities:
Key Data Structures:
3. Gateway Events Service
Responsibilities:
Event Publishing:
4. LLM Deployment Service
Responsibilities:
Deployment Flow:
Database Indexes
Performance-Critical Indexes:
UX
Gateway listing view
Out of Scope
Not Included in This Implementation
Gateway Auto-Discovery - Gateways must be manually registered; automatic discovery via service mesh or K8s API is not included.
Gateway Load Balancing Configuration - Control plane does not manage load balancer configuration; this is handled externally (e.g., AWS ALB, NGINX).
Gateway Autoscaling - Dynamic scaling of gateway instances based on load is not managed by the control plane.
Advanced Deployment Strategies - Blue/green deployments, canary releases, and phased rollouts are not supported.
Configuration Rollback - Automatic rollback of failed deployments is not implemented; manual undeployment is required.
Gateway Monitoring/Observability - Metrics collection, logging aggregation, and distributed tracing are out of scope.
Gateway-to-Gateway Communication - Service mesh features like inter-gateway routing and traffic splitting are not included.
Custom Gateway Plugins - Dynamic loading of custom Envoy filters or plugins is not supported.
Multi-Cloud Gateway Orchestration - Managing gateways across multiple cloud providers (AWS, Azure, GCP) with cloud-specific features is not included.
Gateway Backup/Restore - Disaster recovery and gateway configuration backup/restore are not covered.
LLM Provider, LLM Proxy management - Create and manage lifecycle of llm providers, and proxies
Key Decisions made
Implement the gateway controlplane in Agent Manager
The api-platform control plane is capable of handling ai gateway management. But, there are several missing features and dependencies when directly communicating with the control plane api.
Hence, it's decided to implement the gateway controlplane functionality in Agent manager which offers,
Beta Was this translation helpful? Give feedback.
All reactions