[Design Proposal] AI Gateway Management for Agent Manager #285

menakaj · 2026-02-04T20:35:27Z

menakaj
Feb 4, 2026
Collaborator

Gateway Management Implementation Proposal

Problem Statement

The Agent Manager platform needs a comprehensive gateway management system that enables organizations to:

Register and manage AI gateway instances - Organizations deploy their own gateway infrastructure and need to register these instances with the Agent Manager control plane for centralized management.
Environment-based gateway organization - Gateways must be organized by deployment environments (development, staging, production) to support environment-specific deployment workflows and isolation.
Secure gateway authentication - Gateways require secure, token-based authentication to communicate with the control plane and receive deployment events.
Real-time deployment orchestration - When LLM providers, proxies, or APIs are deployed, the control plane must push these configurations to the appropriate gateways in real-time via WebSocket connections.
Multi-gateway deployment support - A single LLM provider or API can be deployed to multiple gateways simultaneously, enabling horizontal scaling and multi-region deployments.
Gateway health and status tracking - The platform must track which gateways are online, their connection status, and deployed artifacts for operational visibility.

User Stories

Platform Administrator

As a platform administrator, I want to register gateway instances with their connection details so that Agent Manager can orchestrate deployments to them.
As a platform administrator, I want to organize gateways into environments (dev, staging, prod) so that I can deploy resources to all gateways in an environment with a single operation.
As a platform administrator, I want to generate and rotate authentication tokens for gateways so that I can maintain secure communication channels.
As a platform administrator, I want to monitor gateway connection status from a central location so that I can quickly identify connectivity issues.
As a platform administrator, I want to view which LLM providers and APIs are deployed to each gateway so that I can understand deployment topology.

Development Team

As a developer, I want to deploy LLM providers to specific environments (e.g., "deploy to all staging gateways") so that I can test integrations before production.
As a developer, I want the platform to automatically push configuration changes to connected gateways so that deployments are immediate and consistent.
As a developer, I want to deploy a single LLM provider to multiple gateways simultaneously so that I can support load balancing and high availability.

Gateway Operators

As a gateway operator, I want my gateway to establish a persistent WebSocket connection to the control plane so that it receives deployment events in real-time.
As a gateway operator, I want my gateway to authenticate using a secure token so that only authorized gateways can connect to the control plane.
As a gateway operator, I want my gateway to automatically receive LLM provider configurations when they're deployed so that I don't need manual intervention.

Architecture Overview

The implemented solution uses a centralized control plane architecture where Agent Manager serves as the single source of truth for gateway configurations and orchestrates all deployments via WebSocket-based event streaming.

Key Components

┌─────────────────────────────────────────────────────────────────┐
│                    AGENT MANAGER (Control Plane)                │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │  Platform Gateway Service                                  │ │
│  │  - Gateway registration and lifecycle management          │ │
│  │  - Token generation and verification                      │ │
│  │  - Gateway-environment mappings                           │ │
│  │  - Active status tracking                                 │ │
│  └───────────────────────────────────────────────────────────┘ │
│                         ↓                                       │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │  WebSocket Manager                                         │ │
│  │  - Persistent gateway connections (wss://)                │ │
│  │  - Connection lifecycle management                        │ │
│  │  - Event broadcasting to target gateways                  │ │
│  │  - Heartbeat monitoring                                   │ │
│  └───────────────────────────────────────────────────────────┘ │
│                         ↓                                       │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │  Gateway Events Service                                    │ │
│  │  - LLM provider deployment events (api.deployed)          │ │
│  │  - LLM provider undeployment events (api.undeployed)      │ │
│  │  - LLM deployment events (llm.deployed)                   │ │
│  │  - LLM undeployment events (llm.undeployed)               │ │
│  │  - Event payload serialization and validation             │ │
│  └───────────────────────────────────────────────────────────┘ │
│                         ↓                                       │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │  LLM Deployment Service                                    │ │
│  │  - Deploy LLM providers to environments/gateways          │ │
│  │  - Multi-gateway deployment orchestration                 │ │
│  │  - Deployment status tracking                             │ │
│  └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                           ↓
        WebSocket Event Stream (wss://control-plane/gateways/ws)
                           ↓
┌─────────────────────────────────────────────────────────────────┐
│                    GATEWAY INSTANCES (Data Plane)               │
│                                                                 │
│  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐  │
│  │ Gateway (Prod) │  │ Gateway (Stg)  │  │ Gateway (Dev)  │  │
│  │ - Envoy proxy  │  │ - Envoy proxy  │  │ - Envoy proxy  │  │
│  │ - WS client    │  │ - WS client    │  │ - WS client    │  │
│  │ - Config sync  │  │ - Config sync  │  │ - Config sync  │  │
│  └────────────────┘  └────────────────┘  └────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Design Principles

1. Control Plane / Data Plane Separation
Agent Manager acts as the control plane for policy decisions and orchestration. Gateways (running Envoy) act as the data plane for request routing and policy enforcement. Configuration flows from control plane → data plane via WebSocket events.

2. Event-Driven Architecture
All deployment operations trigger events that are pushed to connected gateways. This eliminates polling overhead and enables real-time configuration updates.

3. Gateway-First Security
Gateways authenticate to the control plane (not vice versa). Each gateway uses a unique, salted token hash for authentication. Tokens can be rotated without service interruption (max 2 active tokens).

4. Environment Abstraction
Environments (dev/staging/prod) are first-class entities. Gateway-environment mappings support many-to-many relationships (one gateway in multiple environments, or multiple gateways in one environment).

5. Functionality-Based Gateway Types
Gateways are typed by functionality:

AI - Handles outbound LLM provider requests and MCP integrations
Regular - Handles inbound REST API traffic (future use)
Event - Handles event streaming (future use)

Data Model

Core Tables

1. `gateways`

Represents a registered gateway instance within an organization.

CREATE TABLE gateways (
    uuid UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_uuid UUID NOT NULL REFERENCES organizations(uuid) ON DELETE CASCADE,
    name VARCHAR(255) NOT NULL,
    display_name VARCHAR(255) NOT NULL,
    description TEXT,
    properties JSONB NOT NULL DEFAULT '{}'::jsonb,
    vhost VARCHAR(255) NOT NULL,
    is_critical BOOLEAN DEFAULT FALSE,
    gateway_functionality_type VARCHAR(20) DEFAULT 'regular' NOT NULL,
    is_active BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    deleted_at TIMESTAMP,

    CONSTRAINT uq_gateway_org_name UNIQUE(organization_uuid, name),
    CONSTRAINT chk_gateway_functionality_type
        CHECK (gateway_functionality_type IN ('regular', 'ai', 'event'))
);

Key Fields:

is_active - Set to true when gateway establishes WebSocket connection
is_critical - Flag for critical production gateways requiring special monitoring
properties - JSONB field for gateway-specific metadata and configuration
gateway_functionality_type - Determines what artifacts can be deployed

2. `gateway_tokens`

Authentication tokens for gateway-to-control-plane communication.

CREATE TABLE gateway_tokens (
    uuid UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    gateway_uuid UUID NOT NULL REFERENCES gateways(uuid) ON DELETE CASCADE,
    token_hash VARCHAR(255) NOT NULL,
    salt VARCHAR(255) NOT NULL,
    status VARCHAR(10) NOT NULL DEFAULT 'active',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    revoked_at TIMESTAMP,

    CONSTRAINT chk_gateway_token_status
        CHECK (status IN ('active', 'revoked')),
    CONSTRAINT chk_gateway_token_revoked
        CHECK (revoked_at IS NULL OR status = 'revoked')
);

Security Model:

Tokens are never stored in plain text - only salted SHA-256 hashes are persisted
Each gateway can have max 2 active tokens (allows zero-downtime rotation)
Plain-text token is returned only once during creation/rotation
Verification uses constant-time comparison to prevent timing attacks

3. `environments`

Logical grouping of gateways by deployment stage.

CREATE TABLE environments (
    uuid UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_name VARCHAR(100) NOT NULL,
    name VARCHAR(64) NOT NULL,
    display_name VARCHAR(128) NOT NULL,
    description TEXT,
    dataplane_ref VARCHAR(100) NOT NULL DEFAULT 'default',
    dns_prefix VARCHAR(100) NOT NULL DEFAULT 'default',
    is_production BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
    deleted_at TIMESTAMP,

    UNIQUE(organization_name, name)
);

Purpose:
Environments enable environment-level deployment operations. When deploying an LLM provider to an environment, it's automatically deployed to all gateways mapped to that environment.

4. `gateway_environment_mappings`

Many-to-many relationship between gateways and environments.

CREATE TABLE gateway_environment_mappings (
    id SERIAL PRIMARY KEY,
    gateway_uuid UUID NOT NULL REFERENCES gateways(uuid) ON DELETE CASCADE,
    environment_uuid UUID NOT NULL REFERENCES environments(uuid) ON DELETE CASCADE,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),

    UNIQUE(gateway_uuid, environment_uuid)
);

Use Cases:

Single gateway in multiple environments (e.g., shared dev/test gateway)
Multiple gateways in one environment (horizontal scaling for production)
Dynamic environment reassignment without recreating gateways

5. `llm_provider_deployments`

Tracks which LLM providers are deployed to which gateways.

CREATE TABLE llm_provider_deployments (
    uuid UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    llm_provider_uuid UUID NOT NULL REFERENCES llm_providers(uuid) ON DELETE CASCADE,
    gateway_uuid UUID NOT NULL REFERENCES gateways(uuid) ON DELETE CASCADE,
    environment_uuid UUID REFERENCES environments(uuid) ON DELETE SET NULL,
    deployment_status VARCHAR(20) DEFAULT 'pending',
    deployed_at TIMESTAMP,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP NOT NULL DEFAULT NOW(),

    CONSTRAINT chk_llm_deployment_status
        CHECK (deployment_status IN ('pending', 'deployed', 'failed', 'undeployed'))
);

Status Lifecycle:

pending - Deployment event sent to gateway, awaiting confirmation
deployed - Gateway confirmed successful deployment
failed - Gateway reported deployment failure
undeployed - Removed from gateway

API Design

Gateway Management Endpoints

All gateway endpoints are scoped to organizations and use the pattern:
/orgs/{orgName}/gateways/*

1. Register Gateway

POST /orgs/{orgName}/gateways
Content-Type: application/json

{
  "name": "prod-gateway-1",
  "displayName": "Production Gateway 1",
  "gatewayType": "AI",
  "vhost": "gateway.example.com",
  "region": "us-east-1",
  "isCritical": true,
  "environmentIds": ["env-uuid-1", "env-uuid-2"]
}

Response:

{
  "uuid": "gateway-uuid",
  "organizationName": "acme",
  "name": "prod-gateway-1",
  "displayName": "Production Gateway 1",
  "gatewayType": "AI",
  "vhost": "gateway.example.com",
  "region": "us-east-1",
  "isCritical": true,
  "status": "INACTIVE",
  "createdAt": "2026-02-13T10:00:00Z",
  "updatedAt": "2026-02-13T10:00:00Z",
  "environments": [
    {
      "id": "env-uuid-1",
      "name": "production",
      "displayName": "Production"
    }
  ]
}

Behavior:

Generates initial authentication token (returned separately via token endpoint)
Creates gateway-environment mappings if environmentIds provided
Sets is_active = false until gateway connects via WebSocket

2. List Gateways

GET /orgs/{orgName}/gateways?type=AI&environment=production&status=ACTIVE

Response:
{
  "gateways": [
    {
      "uuid": "...",
      "name": "prod-gateway-1",
      "displayName": "Production Gateway 1",
      "gatewayType": "AI",
      "status": "ACTIVE",
      ...
    }
  ],
  "total": 1,
  "limit": 50,
  "offset": 0
}

Query Parameters:

type - Filter by gateway type (AI, Regular, Event)
environment - Filter by environment name
status - Filter by status (ACTIVE, INACTIVE)
limit, offset - Pagination

3. Get Gateway Details

GET /orgs/{orgName}/gateways/{gatewayID}

Response:
{
  "uuid": "gateway-uuid",
  "organizationName": "acme",
  "name": "prod-gateway-1",
  "displayName": "Production Gateway 1",
  "gatewayType": "AI",
  "vhost": "gateway.example.com",
  "isCritical": true,
  "status": "ACTIVE",
  "environments": [
    {
      "id": "env-uuid",
      "name": "production",
      "displayName": "Production",
      "dataplaneRef": "default",
      "dnsPrefix": "prod",
      "isProduction": true
    }
  ],
  "createdAt": "2026-02-13T10:00:00Z",
  "updatedAt": "2026-02-13T10:30:00Z"
}

4. Update Gateway

PUT /orgs/{orgName}/gateways/{gatewayID}
Content-Type: application/json

{
  "displayName": "Production Gateway 1 (Updated)",
  "isCritical": false
}

Updatable Fields:

displayName - Human-readable name
isCritical - Critical flag
status - Gateway status (ACTIVE, INACTIVE)

Immutable Fields:

name - Gateway identifier
gatewayType - Functionality type
vhost - Virtual host
organizationUUID - Ownership

5. Delete Gateway

DELETE /orgs/{orgName}/gateways/{gatewayID}

Response: 204 No Content

Behavior:

Soft delete (sets deleted_at timestamp)
Cascade deletes all gateway tokens (FK constraint)
Removes all gateway-environment mappings
Closes any active WebSocket connections
Does NOT delete deployment records (preserved for audit)

Token Management Endpoints

1. Rotate Gateway Token

POST /orgs/{orgName}/gateways/{gatewayID}/tokens/rotate

Response:
{
  "id": "token-uuid",
  "token": "eyJhbGc...",  // Plain-text token (ONLY returned here!)
  "createdAt": "2026-02-13T11:00:00Z",
  "message": "New token generated. Old token remains active until revoked."
}

Security Rules:

Maximum 2 active tokens per gateway
Returns HTTP 400 if limit reached (must revoke old token first)
Plain-text token is never stored and never retrievable after this response
Token uses cryptographically secure randomness (32 bytes, base64-encoded)

2. Revoke Gateway Token

DELETE /orgs/{orgName}/gateways/{gatewayID}/tokens/{tokenID}

Response: 204 No Content

Behavior:

Sets status = 'revoked' and revoked_at = NOW()
Gateway connections using revoked token are forcefully closed
Revoked tokens cannot be un-revoked (create new token instead)

Environment Management Endpoints

1. Create Environment

POST /orgs/{orgName}/environments
Content-Type: application/json

{
  "name": "production",
  "displayName": "Production Environment",
  "description": "Production deployment environment",
  "dataplaneRef": "us-east-dataplane",
  "dnsPrefix": "prod",
  "isProduction": true
}

2. Map Gateway to Environment

POST /orgs/{orgName}/gateways/{gatewayID}/environments/{envID}

Response: 201 Created

Behavior:

Creates gateway-environment mapping
Idempotent (returns 200 OK if mapping already exists)
Validates both gateway and environment belong to same organization

3. Remove Gateway from Environment

DELETE /orgs/{orgName}/gateways/{gatewayID}/environments/{envID}

Response: 204 No Content

4. List Gateway Environments

GET /orgs/{orgName}/gateways/{gatewayID}/environments

Response:
{
  "environments": [
    {
      "id": "env-uuid",
      "name": "production",
      "displayName": "Production",
      "isProduction": true,
      ...
    }
  ]
}

5. List Environment Gateways

GET /orgs/{orgName}/environments/{envID}/gateways

Response:
{
  "gateways": [
    {
      "uuid": "gw-uuid",
      "name": "prod-gateway-1",
      "status": "ACTIVE",
      ...
    }
  ]
}

Gateway Internal API

These endpoints are called by gateways (not by users). They use gateway token authentication instead of JWT.

1. Get LLM Provider Details

GET /internal/llm-providers/{providerId}
Authorization: Bearer <gateway-token>

Response:
{
  "uuid": "provider-uuid",
  "configuration": {
    "name": "openai-provider",
    "version": "v1",
    "upstream": {
      "url": "https://api.openai.com/v1"
    },
    "accessControl": {
      "mode": "ALLOW_ALL"
    }
  },
  "openapi": "openapi: 3.0.0\n...",
  "modelProviders": [
    {
      "id": "openai",
      "models": [
        {"id": "gpt-4", "name": "GPT-4"}
      ]
    }
  ]
}

Purpose:
When a gateway receives a deployment event, it calls this endpoint to fetch the full LLM provider configuration and OpenAPI spec for dynamic route creation.

2. Report Deployment Status

POST /internal/deployments/{deploymentId}/status
Authorization: Bearer <gateway-token>
Content-Type: application/json

{
  "status": "deployed",
  "message": "Successfully deployed LLM provider"
}

Status Values:

deployed - Successfully applied configuration
failed - Deployment failed (includes error message)
undeployed - Successfully removed configuration

WebSocket Architecture

Connection Lifecycle

1. Gateway Connection Establishment

Gateway → Control Plane
WebSocket Upgrade: wss://control-plane/gateways/ws
Authorization: Bearer <gateway-token>

Handshake Flow:

1. Gateway sends WebSocket upgrade request with token in Authorization header
2. Control plane validates token:
   - Retrieves all active tokens from gateway_tokens table
   - Computes hash of provided token with each stored salt
   - Constant-time comparison against stored hashes
3. If valid:
   - Accept WebSocket connection
   - Add connection to WebSocket manager
   - Update gateway.is_active = true in database
   - Send connection_established event
4. If invalid:
   - Reject WebSocket upgrade (401 Unauthorized)

2. Heartbeat Mechanism

Every 30 seconds:
  Gateway → Control Plane: {"type": "ping", "timestamp": "..."}
  Control Plane → Gateway: {"type": "pong", "timestamp": "..."}

If no heartbeat received for 90 seconds:
  - Control plane marks gateway.is_active = false
  - Control plane closes WebSocket connection

3. Connection Termination

Graceful Shutdown:

Gateway → Control Plane: Close frame with code 1000
Control Plane:
  - Update gateway.is_active = false
  - Remove connection from WebSocket manager
  - Clean up resources

Unexpected Disconnect:

Control Plane detects broken connection:
  - Update gateway.is_active = false
  - Remove connection from WebSocket manager
  - Log disconnect event for monitoring

Event Types

1. LLM Provider Deployment Event

Trigger: User calls POST /orgs/{orgName}/llm-providers/{providerId}/deploy

Event Payload:

{
  "type": "llm.deployed",
  "payload": {
    "llmProviderId": "provider-uuid",
    "deploymentId": "deployment-uuid",
    "environmentId": "env-uuid",
    "gatewayId": "gateway-uuid"
  },
  "timestamp": "2026-02-13T12:00:00Z",
  "correlationId": "correlation-uuid"
}

Gateway Action:

Receive event via WebSocket
Call GET /internal/llm-providers/{providerId} to fetch configuration
Generate Envoy xDS configuration from OpenAPI spec
Apply configuration to Envoy control plane
Call POST /internal/deployments/{deploymentId}/status with result

2. LLM Provider Undeployment Event

Trigger: User calls DELETE /orgs/{orgName}/llm-providers/{providerId}/deployments/{deploymentId}

Event Payload:

{
  "type": "llm.undeployed",
  "payload": {
    "llmProviderId": "provider-uuid",
    "deploymentId": "deployment-uuid",
    "gatewayId": "gateway-uuid"
  },
  "timestamp": "2026-02-13T12:05:00Z",
  "correlationId": "correlation-uuid"
}

Gateway Action:

Receive event via WebSocket
Remove LLM provider routes from Envoy configuration
Call POST /internal/deployments/{deploymentId}/status with status=undeployed

3. API Deployment Event (Future)

{
  "type": "api.deployed",
  "payload": {
    "apiId": "api-uuid",
    "deploymentId": "deployment-uuid",
    "gatewayId": "gateway-uuid"
  },
  ...
}

Multi-Gateway Deployment Flow

When deploying to an environment (not a specific gateway), the control plane broadcasts to all gateways in that environment.

Example:
Deploy LLM provider to "production" environment with 3 gateways.

1. User: POST /orgs/acme/llm-providers/provider-1/deploy
   Body: {"environmentId": "prod-env-uuid"}

2. Control plane:
   - Query gateway_environment_mappings for prod-env-uuid
   - Finds: gateway-1, gateway-2, gateway-3
   - Create 3 deployment records (one per gateway)

3. For each gateway:
   - Lookup active WebSocket connection in WebSocket manager
   - Send llm.deployed event via WebSocket
   - Wait for status callback from gateway

4. Gateway (each independently):
   - Receive event
   - Fetch LLM provider config
   - Apply to local Envoy
   - Report status (deployed/failed)

5. Control plane:
   - Update deployment_status for each gateway
   - Return overall deployment status to user

Partial Failure Handling:
If deployment succeeds on gateway-1 and gateway-2 but fails on gateway-3, the deployment is marked as "PARTIALLY_DEPLOYED" with detailed per-gateway status.

Security Model

Authentication

User Requests: JWT-based authentication (existing Agent Manager auth)
Gateway Requests: Token-based authentication using gateway tokens

Token Security

Storage:

Plain-text tokens are never persisted
Only SHA-256(token || salt) is stored in gateway_tokens.token_hash
Unique 32-byte salt per token stored in gateway_tokens.salt

Verification:

func verifyToken(plainToken, storedHash, storedSalt string) bool {
    salt := hex.Decode(storedSalt)
    computedHash := sha256(plainToken + salt)
    return constantTimeCompare(computedHash, hex.Decode(storedHash))
}

Rotation:

Maximum 2 active tokens per gateway (supports zero-downtime rotation)
Old token remains valid until explicitly revoked
Rotation workflow:
1. Generate new token (now 2 active tokens)
2. Update gateway to use new token
3. Revoke old token (back to 1 active token)

Authorization

Gateway Scoping:
When a gateway connects via WebSocket, its identity is bound to the connection. The gateway can only:

Receive events for its own gateway UUID
Fetch configurations for artifacts deployed to it
Report status for its own deployments

Organization Isolation:
All gateway operations are scoped to organizations. Gateways cannot access resources from other organizations.

Deployment Orchestration

LLM Provider Deployment

User-Initiated Flow:

1. User creates LLM provider:
   POST /orgs/{orgName}/llm-providers
   {
     "name": "openai-provider",
     "templateId": "openai-template-uuid",
     "configuration": {...}
   }

   → Creates llm_providers record
   → Status: NOT_DEPLOYED

2. User deploys to environment:
   POST /orgs/{orgName}/llm-providers/{providerId}/deploy
   {
     "environmentId": "prod-env-uuid"
   }

   → LLM Deployment Service:
     a. Query gateways in environment
     b. For each gateway:
        - Create llm_provider_deployments record (status=pending)
        - Check if gateway is connected (is_active=true)
        - If connected: send llm.deployed event
        - If disconnected: deployment remains pending

   → Gateway Events Service:
     - Serialize deployment payload
     - Validate payload size (< 1MB)
     - Broadcast to target gateway via WebSocket

   → WebSocket Manager:
     - Lookup connection by gateway UUID
     - Send JSON event over WebSocket

3. Gateway receives event:
   - Parse llm.deployed event
   - Fetch full config via GET /internal/llm-providers/{providerId}
   - Generate Envoy xDS config
   - Apply to Envoy control plane
   - Report status via POST /internal/deployments/{deploymentId}/status

4. Control plane receives status:
   - Update llm_provider_deployments.deployment_status
   - Update llm_provider_deployments.deployed_at
   - Return success/failure to user

Offline Gateway Handling:

If a gateway is offline (is_active = false) when deployment is initiated:

Deployment record is created with status = pending
Event is not sent (connection doesn't exist)
When gateway reconnects:
- Control plane queries pending deployments
- Sends all pending deployment events
- Gateway processes backlog and reports status

Multi-Gateway Deployment

Example Scenario:
Deploy openai-provider to environment "production" which has gateways: prod-us-east, prod-us-west, prod-eu-west.

POST /orgs/acme/llm-providers/openai-provider/deploy
{
  "environmentId": "production-uuid"
}

Control Plane Actions:
1. Query gateway_environment_mappings WHERE environment_uuid = 'production-uuid'
   → Returns: [prod-us-east, prod-us-west, prod-eu-west]

2. For each gateway, create deployment record:
   - llm_provider_deployments (provider=openai, gateway=prod-us-east, status=pending)
   - llm_provider_deployments (provider=openai, gateway=prod-us-west, status=pending)
   - llm_provider_deployments (provider=openai, gateway=prod-eu-west, status=pending)

3. Broadcast events in parallel:
   - Send llm.deployed to prod-us-east (if connected)
   - Send llm.deployed to prod-us-west (if connected)
   - Send llm.deployed to prod-eu-west (if connected)

4. Each gateway independently:
   - Fetches configuration
   - Applies to Envoy
   - Reports status

5. User receives aggregated response:
   {
     "deploymentId": "deployment-uuid",
     "status": "DEPLOYED",
     "gateways": [
       {"name": "prod-us-east", "status": "deployed"},
       {"name": "prod-us-west", "status": "deployed"},
       {"name": "prod-eu-west", "status": "deployed"}
     ]
   }

Implementation Details

Technology Stack

Control Plane:

Language: Go 1.24+
HTTP Server: net/http with http.ServeMux
Dependency Injection: Wire
Database: PostgreSQL with GORM
WebSocket: gorilla/websocket
Authentication: JWT for users, custom token auth for gateways

Data Plane (Gateway):

Proxy: Envoy (xDS API for dynamic configuration)
WebSocket Client: Custom Go client
Configuration Management: In-memory xDS snapshot cache

Key Services

1. Platform Gateway Service

Responsibilities:

Gateway CRUD operations
Token generation, verification, rotation
Gateway-environment mapping management
Active status tracking

Methods:

type PlatformGatewayService interface {
    RegisterGateway(orgID, name, displayName, vhost string, ...) (*GatewayResponse, error)
    ListGateways(orgID *string) (*GatewayListResponse, error)
    GetGateway(gatewayID, orgID string) (*GatewayResponse, error)
    UpdateGateway(gatewayID, orgID string, ...) (*GatewayResponse, error)
    DeleteGateway(gatewayID, orgID string) error
    RotateToken(gatewayID, orgID string) (*TokenRotationResponse, error)
    VerifyToken(plainToken string) (*models.Gateway, error)
    UpdateGatewayActiveStatus(gatewayID string, isActive bool) error
}

2. WebSocket Manager

Responsibilities:

Accept gateway WebSocket connections
Maintain active connection registry
Route events to specific gateways
Heartbeat monitoring
Connection lifecycle management

Key Data Structures:

type Manager struct {
    connections map[string]*Connection  // gatewayID → connection
    mu          sync.RWMutex
}

type Connection struct {
    gatewayID   string
    orgID       string
    conn        *websocket.Conn
    sendCh      chan []byte
    closeCh     chan struct{}
    lastPing    time.Time
}

3. Gateway Events Service

Responsibilities:

Event serialization and validation
Event broadcasting to target gateways
Correlation ID tracking
Event payload size limits (1MB)

Event Publishing:

func (s *GatewayEventsService) BroadcastDeploymentEvent(
    gatewayID string,
    deployment *DeploymentEvent,
) error {
    event := GatewayEventDTO{
        Type:          "llm.deployed",
        Payload:       deployment,
        Timestamp:     time.Now().Format(time.RFC3339),
        CorrelationID: uuid.New().String(),
    }

    eventJSON, _ := json.Marshal(event)
    if len(eventJSON) > MaxEventPayloadSize {
        return ErrPayloadTooLarge
    }

    return s.manager.SendToGateway(gatewayID, eventJSON)
}

4. LLM Deployment Service

Responsibilities:

Multi-gateway deployment orchestration
Deployment record creation and tracking
Environment-based gateway resolution
Deployment status aggregation

Deployment Flow:

func (s *LLMDeploymentService) DeployToEnvironment(
    providerID, envID, orgID string,
) (*DeploymentResponse, error) {
    // 1. Get all gateways in environment
    gateways, _ := s.gatewayRepo.GetByEnvironmentID(envID)

    // 2. Create deployment records for each gateway
    deployments := make([]*models.LLMProviderDeployment, 0)
    for _, gw := range gateways {
        deployment := &models.LLMProviderDeployment{
            UUID:              uuid.New(),
            LLMProviderUUID:   uuid.Parse(providerID),
            GatewayUUID:       gw.UUID,
            EnvironmentUUID:   uuid.Parse(envID),
            DeploymentStatus:  "pending",
            CreatedAt:         time.Now(),
        }
        s.deploymentRepo.Create(deployment)
        deployments = append(deployments, deployment)
    }

    // 3. Send events to connected gateways
    for _, deployment := range deployments {
        if gw.IsActive {
            s.eventsService.BroadcastDeploymentEvent(
                gw.UUID.String(),
                &DeploymentEvent{
                    LLMProviderID: providerID,
                    DeploymentID:  deployment.UUID.String(),
                    GatewayID:     gw.UUID.String(),
                },
            )
        }
    }

    return BuildDeploymentResponse(deployments), nil
}

Database Indexes

Performance-Critical Indexes:

-- Gateway lookups
CREATE INDEX idx_gateways_org ON gateways(organization_uuid);
CREATE INDEX idx_gateways_active ON gateways(is_active);
CREATE INDEX idx_gateways_functionality_type ON gateways(gateway_functionality_type);

-- Gateway token lookups
CREATE INDEX idx_gateway_tokens_gateway ON gateway_tokens(gateway_uuid);
CREATE INDEX idx_gateway_tokens_status ON gateway_tokens(gateway_uuid, status);
CREATE INDEX idx_gateway_tokens_active ON gateway_tokens(status) WHERE status = 'active';

-- Environment mappings
CREATE INDEX idx_gem_gateway ON gateway_environment_mappings(gateway_uuid);
CREATE INDEX idx_gem_environment ON gateway_environment_mappings(environment_uuid);

-- Deployment lookups
CREATE INDEX idx_llm_deployments_provider ON llm_provider_deployments(llm_provider_uuid);
CREATE INDEX idx_llm_deployments_gateway ON llm_provider_deployments(gateway_uuid);
CREATE INDEX idx_llm_deployments_status ON llm_provider_deployments(deployment_status);

UX

Gateway listing view

Out of Scope

Not Included in This Implementation

Gateway Auto-Discovery - Gateways must be manually registered; automatic discovery via service mesh or K8s API is not included.
Gateway Load Balancing Configuration - Control plane does not manage load balancer configuration; this is handled externally (e.g., AWS ALB, NGINX).
Gateway Autoscaling - Dynamic scaling of gateway instances based on load is not managed by the control plane.
Advanced Deployment Strategies - Blue/green deployments, canary releases, and phased rollouts are not supported.
Configuration Rollback - Automatic rollback of failed deployments is not implemented; manual undeployment is required.
Gateway Monitoring/Observability - Metrics collection, logging aggregation, and distributed tracing are out of scope.
Gateway-to-Gateway Communication - Service mesh features like inter-gateway routing and traffic splitting are not included.
Custom Gateway Plugins - Dynamic loading of custom Envoy filters or plugins is not supported.
Multi-Cloud Gateway Orchestration - Managing gateways across multiple cloud providers (AWS, Azure, GCP) with cloud-specific features is not included.
Gateway Backup/Restore - Disaster recovery and gateway configuration backup/restore are not covered.
LLM Provider, LLM Proxy management - Create and manage lifecycle of llm providers, and proxies

Key Decisions made

Implement the gateway controlplane in Agent Manager

The api-platform control plane is capable of handling ai gateway management. But, there are several missing features and dependencies when directly communicating with the control plane api.

Need special jwt token for communication
Extra workload deployment required for the controlplane
There are missing features such as policy management, which will affect the agent manager capabilities.

Hence, it's decided to implement the gateway controlplane functionality in Agent manager which offers,

Much flexibility of implementing necessary features without waiting for api platform to be released.
Direct communication without any additional network hops, as well as lighter/ cost effective deployment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design Proposal] AI Gateway Management for Agent Manager #285

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[Design Proposal] AI Gateway Management for Agent Manager #285

Uh oh!

Uh oh!

menakaj Feb 4, 2026 Collaborator

Gateway Management Implementation Proposal

Problem Statement

User Stories

Platform Administrator

Development Team

Gateway Operators

Architecture Overview

Key Components

Design Principles

Data Model

Core Tables

1. gateways

2. gateway_tokens

3. environments

4. gateway_environment_mappings

5. llm_provider_deployments

API Design

Gateway Management Endpoints

1. Register Gateway

2. List Gateways

3. Get Gateway Details

4. Update Gateway

5. Delete Gateway

Token Management Endpoints

1. Rotate Gateway Token

2. Revoke Gateway Token

Environment Management Endpoints

1. Create Environment

2. Map Gateway to Environment

3. Remove Gateway from Environment

4. List Gateway Environments

5. List Environment Gateways

Gateway Internal API

1. Get LLM Provider Details

2. Report Deployment Status

WebSocket Architecture

Connection Lifecycle

1. Gateway Connection Establishment

2. Heartbeat Mechanism

3. Connection Termination

Event Types

1. LLM Provider Deployment Event

2. LLM Provider Undeployment Event

3. API Deployment Event (Future)

Multi-Gateway Deployment Flow

Security Model

Authentication

Token Security

Authorization

Deployment Orchestration

LLM Provider Deployment

Multi-Gateway Deployment

Implementation Details

Technology Stack

Key Services

1. Platform Gateway Service

2. WebSocket Manager

3. Gateway Events Service

4. LLM Deployment Service

Database Indexes

UX

Out of Scope

Not Included in This Implementation

Key Decisions made

Implement the gateway controlplane in Agent Manager

Replies: 0 comments

menakaj
Feb 4, 2026
Collaborator

1. `gateways`

2. `gateway_tokens`

3. `environments`

4. `gateway_environment_mappings`

5. `llm_provider_deployments`