SGLang Kubernetes Deployment Configurations

This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the DynamoGraphDeployment resource.

Available Deployment Patterns

1. Aggregated Deployment (`agg.yaml`)

Basic deployment pattern with frontend and a single decode worker.

Architecture:

Frontend: OpenAI-compatible API server
SGLangDecodeWorker: Single worker handling both prefill and decode

2. Aggregated Router Deployment (`agg_router.yaml`)

Enhanced aggregated deployment with KV cache routing capabilities.

Architecture:

Frontend: OpenAI-compatible API server with router mode enabled (--router-mode kv)
SGLangDecodeWorker: Single worker handling both prefill and decode

3. Disaggregated Deployment (`disagg.yaml`)**

High-performance deployment with separated prefill and decode workers.

Architecture:

Frontend: HTTP API server coordinating between workers
SGLangDecodeWorker: Specialized decode-only worker (--disaggregation-mode decode)
SGLangPrefillWorker: Specialized prefill-only worker (--disaggregation-mode prefill)
Communication via NIXL transfer backend (--disaggregation-transfer-backend nixl)

CRD Structure

All templates use the DynamoGraphDeployment CRD:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: <deployment-name>
spec:
  services:
    <ServiceName>:
      # Service configuration

Key Configuration Options

Resource Management:

resources:
  requests:
    cpu: "10"
    memory: "20Gi"
    gpu: "1"
  limits:
    cpu: "10"
    memory: "20Gi"
    gpu: "1"

Container Configuration:

extraPodSpec:
  mainContainer:
    image: my-registry/sglang-runtime:my-tag
    workingDir: /workspace/examples/backends/sglang
    args:
      - "python3"
      - "-m"
      - "dynamo.sglang"
      # Model-specific arguments

Prerequisites

Before using these templates, ensure you have:

Dynamo Kubernetes Platform installed - See Installing Dynamo Kubernetes Platform
Kubernetes cluster with GPU support
Container registry access for SGLang runtime images
HuggingFace token secret (referenced as envFromSecret: hf-token-secret)

Usage

1. Choose Your Template

Select the deployment pattern that matches your requirements:

Use agg.yaml for development/testing
Use agg_router.yaml for production with load balancing
Use disagg.yaml for maximum performance

2. Customize Configuration

Edit the template to match your environment:

# Update image registry and tag
image: my-registry/sglang-runtime:my-tag

# Configure your model
args:
  - "--model-path"
  - "your-org/your-model"
  - "--served-model-name"
  - "your-org/your-model"

3. Deploy

Use the following command to deploy the deployment file.

First, create a secret for the HuggingFace token.

export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}

Then, deploy the model using the deployment file.

export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE}

4. Using Custom Dynamo Frameworks Image for SGLang

To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq:

export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<sglang-image>

yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE  > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE

Model Configuration

All templates use DeepSeek-R1-Distill-Llama-8B as the default model. But you can use any sglang argument and configuration. Key parameters:

Monitoring and Health

Frontend health endpoint: http://<frontend-service>:8000/health
Liveness probes: Check process health every 60s

Troubleshooting

Common issues and solutions:

Pod fails to start: Check image registry access and HuggingFace token secret
GPU not allocated: Verify cluster has GPU nodes and proper resource limits
Health check failures: Review model loading logs and increase initialDelaySeconds
Out of memory: Increase memory limits or reduce model batch size

For additional support, refer to the deployment guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGLang Kubernetes Deployment Configurations

Available Deployment Patterns

1. Aggregated Deployment (`agg.yaml`)

2. Aggregated Router Deployment (`agg_router.yaml`)

3. Disaggregated Deployment (`disagg.yaml`)**

CRD Structure

Key Configuration Options

Prerequisites

Usage

1. Choose Your Template

2. Customize Configuration

3. Deploy

4. Using Custom Dynamo Frameworks Image for SGLang

Model Configuration

Monitoring and Health

Further Reading

Troubleshooting

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SGLang Kubernetes Deployment Configurations

Available Deployment Patterns

1. Aggregated Deployment (agg.yaml)

2. Aggregated Router Deployment (agg_router.yaml)

3. Disaggregated Deployment (disagg.yaml)**

CRD Structure

Key Configuration Options

Prerequisites

Usage

1. Choose Your Template

2. Customize Configuration

3. Deploy

4. Using Custom Dynamo Frameworks Image for SGLang

Model Configuration

Monitoring and Health

Further Reading

Troubleshooting

1. Aggregated Deployment (`agg.yaml`)

2. Aggregated Router Deployment (`agg_router.yaml`)

3. Disaggregated Deployment (`disagg.yaml`)**