Skip to content

Add Kubernetes Support for LLM Katan #278

@Xunzhuo

Description

@Xunzhuo

Is your feature request related to a problem? Please describe.

Currently, LLM Katan (the lightweight LLM server for testing) only supports Docker deployment and direct installation via PyPI. While the main Semantic Router project has comprehensive Kubernetes support with manifests in deploy/kubernetes/, LLM Katan lacks dedicated Kubernetes deployment configurations. This creates a gap for users who want to:

  1. Deploy LLM Katan in Kubernetes environments for testing and development workflows
  2. Scale LLM Katan instances horizontally for multi-tenant testing scenarios
  3. Integrate with existing Kubernetes-based CI/CD pipelines that require containerized LLM endpoints
  4. Leverage Kubernetes features like ConfigMaps, Secrets, and persistent volumes for model storage
  5. Use Kubernetes networking for service discovery and load balancing across multiple LLM Katan instances

The current Docker-only approach limits adoption in cloud-native environments where Kubernetes is the standard orchestration platform.

Describe the solution you'd like

Add comprehensive Kubernetes support for LLM Katan with the following components:

1. Kubernetes Manifests

  • Deployment: Configure LLM Katan pods with proper resource limits and health checks
  • Service: Expose LLM Katan API endpoints (HTTP port 8000) within the cluster
  • ConfigMap: Store LLM Katan configuration (model settings, served model names, etc.)
  • PersistentVolumeClaim: Optional storage for model caching and persistence
  • Namespace: Dedicated namespace for LLM Katan deployments

2. Helm Chart (Optional)

  • Parameterized deployment with values.yaml for easy customization
  • Support for multiple model configurations
  • Resource scaling based on workload requirements
  • Integration with existing monitoring solutions

3. Multi-Instance Support

  • Deploy multiple LLM Katan instances with different model configurations
  • Service discovery for different served model names (e.g., "gpt-3.5-turbo", "claude-3-haiku")
  • Load balancing across instances for high availability

4. Documentation

  • Kubernetes deployment guide in e2e-tests/llm-katan/docs/kubernetes.md
  • Examples for common use cases (CI/CD testing, multi-model setups)
  • Integration patterns with the main Semantic Router

5. Example Configurations

# Example: Multi-instance deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-katan-gpt35
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: llm-katan
        image: llm-katan:latest
        args: ["--model", "Qwen/Qwen3-0.6B", "--served-model-name", "gpt-3.5-turbo"]
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"

6. Integration with Existing Infrastructure

  • Leverage patterns from deploy/kubernetes/ in the main project
  • Compatibility with the existing Docker Compose setup
  • Support for the same configuration options as the CLI

Additional context

Current State:

  • LLM Katan has Docker support (e2e-tests/llm-katan/Dockerfile)
  • Docker Compose integration exists in the main project
  • CLI supports all necessary configuration options
  • FastAPI-based architecture is cloud-native ready

Benefits:

  • Testing at Scale: Enable large-scale testing scenarios with multiple model endpoints
  • Cloud-Native Integration: Seamless deployment in modern Kubernetes environments
  • Resource Management: Better resource allocation and monitoring through Kubernetes
  • Service Mesh Compatibility: Integration with Istio, Linkerd, and other service mesh solutions
  • GitOps Workflows: Support for ArgoCD, Flux, and other GitOps tools

Implementation Considerations:

  • Maintain compatibility with existing Docker and PyPI deployment methods
  • Follow Kubernetes best practices for security (non-root containers, resource limits)
  • Support both CPU and GPU workloads (though LLM Katan is designed for tiny models)
  • Provide examples for both development (kind/minikube) and production environments

Related Work:

  • Main Semantic Router already has comprehensive Kubernetes support
  • Docker Compose setup provides a good reference for service configuration
  • Existing Dockerfile can be reused with minimal modifications

/area environment user-experience
/milestone v0.1
/priority P1

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions