Skip to content

Latest commit

 

History

History
523 lines (446 loc) · 38.4 KB

File metadata and controls

523 lines (446 loc) · 38.4 KB

vLLM Production Stack helm chart

This helm chart lets users deploy multiple serving engines and a router into the Kubernetes cluster.

Key features

  • Support running multiple serving engines with multiple different models
  • Load the model weights directly from the existing PersistentVolumes

Prerequisites

  1. A running Kubernetes cluster with GPU. (You can set it up through minikube: https://minikube.sigs.k8s.io/docs/tutorials/nvidia/)
  2. Helm

Install the helm chart

helm dependency build
helm install llmstack . -f values-example.yaml

Uninstall the chart

helm uninstall llmstack

Configure the deployments

See helm/values.yaml for mode details.

Production Stack Helm Chart Values Reference

This table documents all available configuration values for the Production Stack Helm chart.

Table of Contents

Serving Engine Configuration

Field Type Default Description
servingEngineSpec.enableEngine boolean true Whether to enable the serving engine deployment
servingEngineSpec.labels map {environment: "test", release: "test"} Customized labels for the serving engine deployment
servingEngineSpec.vllmApiKey string/map null (Optional) API key for securing vLLM models. Can be a direct string or an object referencing an existing secret
servingEngineSpec.modelSpec list [] Array of specifications for configuring multiple serving engine deployments running different models
servingEngineSpec.containerPort integer 8000 Port the vLLM server container is listening on
servingEngineSpec.servicePort integer 80 Port the service will listen on
servingEngineSpec.configs map {} Set other environment variables from a config map
servingEngineSpec.strategy map {} Deployment strategy for the serving engine pods
servingEngineSpec.maxUnavailablePodDisruptionBudget string "" Configuration for the PodDisruptionBudget for the serving engine pods
servingEngineSpec.tolerations list [] Tolerations configuration for the serving engine pods (when there are taints on nodes)
servingEngineSpec.runtimeClassName string "nvidia" RuntimeClassName configuration (set to "nvidia" if using GPU)
servingEngineSpec.schedulerName string "" SchedulerName configuration for the serving engine pods
servingEngineSpec.securityContext map {} Pod-level security context configuration for the serving engine pods
servingEngineSpec.containerSecurityContext map {runAsNonRoot: false} Container-level security context configuration for the serving engine container
servingEngineSpec.extraPorts list [] List of additional ports to expose for the serving engine container
servingEngineSpec.startupProbe.initialDelaySeconds integer 15 Number of seconds after container starts before startup probe is initiated
servingEngineSpec.startupProbe.periodSeconds integer 10 How often (in seconds) to perform the startup probe
servingEngineSpec.startupProbe.failureThreshold integer 60 Number of failures before considering failed
servingEngineSpec.startupProbe.httpGet.path string "/health" Path to access on the HTTP server
servingEngineSpec.startupProbe.httpGet.port integer 8000 Port to access on the container
servingEngineSpec.livenessProbe.initialDelaySeconds integer 15 Number of seconds after container starts before liveness probe is initiated
servingEngineSpec.livenessProbe.periodSeconds integer 10 How often (in seconds) to perform the liveness probe
servingEngineSpec.livenessProbe.failureThreshold integer 3 Number of failures before considering failed
servingEngineSpec.livenessProbe.httpGet.path string "/health" Path to access on the HTTP server
servingEngineSpec.livenessProbe.httpGet.port integer 8000 Port to access on the container
servingEngineSpec.imagePullPolicy string "Always" Image pull policy for serving engine
servingEngineSpec.extraVolumes list [] Extra volumes for serving engine
servingEngineSpec.extraVolumeMounts list [] Extra volume mounts for serving engine
servingEngineSpec.env list [] (Optional) Global environment variables for all serving engine containers. If a variable is set in both servingEngineSpec.env and servingEngineSpec.modelSpec[].env, the value from modelSpec[].env will override the global value for that model.

Model Specification Fields

Field Type Default Description
servingEngineSpec.modelSpec[].annotations map {} (Optional) Annotations to add to the deployment, e.g., {model: "opt125m"}
servingEngineSpec.modelSpec[].podAnnotations map {} (Optional) Annotations to add to the pod, e.g., {model: "opt125m"}
servingEngineSpec.modelSpec[].name string "" The name of the model, e.g., "example-model"
servingEngineSpec.modelSpec[].repository string "" The repository of the model, e.g., "vllm/vllm-openai"
servingEngineSpec.modelSpec[].tag string "" The tag of the model, e.g., "latest"
servingEngineSpec.modelSpec[].imagePullSecret string "" (Optional) Name of secret with credentials to private container repository
servingEngineSpec.modelSpec[].modelURL string "" The URL of the model, e.g., "facebook/opt-125m"
servingEngineSpec.modelSpec[].chatTemplate string null (Optional) Chat template (Jinja2) specifying tokenizer configuration
servingEngineSpec.modelSpec[].replicaCount integer 1 The number of replicas for the model
servingEngineSpec.modelSpec[].pdb.enabled boolean false Whether to create a PodDisruptionBudget for the model
servingEngineSpec.modelSpec[].pdb.labels map {} Labels to add to the PodDisruptionBudget
servingEngineSpec.modelSpec[].pdb.annotations map {} Annotations to add to the PodDisruptionBudget
servingEngineSpec.modelSpec[].pdb.minAvailable string "" Number of pods that are available after eviction as number or percentage (eg.: 50%)
servingEngineSpec.modelSpec[].pdb.maxUnavailable string "" Number of pods that are unavailable after eviction as number or percentage (eg.: 50%).
servingEngineSpec.modelSpec[].resources object {} Standard Kubernetes resources block (requests/limits). If specified, this takes priority over and ignores simplified resource fields (requestCPU, requestMemory, requestGPU, etc.)
servingEngineSpec.modelSpec[].requestCPU integer 0 The number of CPUs requested for the model
servingEngineSpec.modelSpec[].requestMemory string "" The amount of memory requested for the model, e.g., "16Gi"
servingEngineSpec.modelSpec[].requestGPU integer 0 The number of GPUs requested for the model
servingEngineSpec.modelSpec[].requestGPUType string "nvidia.com/gpu" (Optional) The type of GPU requested, e.g., "nvidia.com/mig-4g.71gb"
servingEngineSpec.modelSpec[].limitCPU string "" (Optional) The CPU limit for the model, e.g., "8"
servingEngineSpec.modelSpec[].limitMemory string "" (Optional) The memory limit for the model, e.g., "32Gi"
servingEngineSpec.modelSpec[].shmSize string "20Gi" Size of the shared memory for the serving engine container (applied when tensor parallelism is enabled)
servingEngineSpec.modelSpec[].enableLoRA boolean true (Optional) Whether to enable LoRA
servingEngineSpec.modelSpec[].pvcStorage string "" (Optional) The amount of storage requested for the model, e.g., "50Gi"
servingEngineSpec.modelSpec[].pvcAccessMode list [] (Optional) The access mode policy for the mounted volume, e.g., ["ReadWriteOnce"]
servingEngineSpec.modelSpec[].storageClass string "" (Optional) The storage class of the PVC
servingEngineSpec.modelSpec[].pvcMatchLabels map {} (Optional) The labels to match the PVC, e.g., {model: "opt125m"}
servingEngineSpec.modelSpec[].pvcLabels map {} (Optional) The labels to add to the PVC, e.g., {label_excluded_from_alerts: "true"}
servingEngineSpec.modelSpec[].pvcAnnotations map {} (Optional) The annotations to add to the PVC
servingEngineSpec.modelSpec[].extraVolumes list [] (Optional) Additional volumes to add to the pod, in Kubernetes volume format
servingEngineSpec.modelSpec[].extraVolumeMounts list [] (Optional) Additional volume mounts to add to the container, in Kubernetes volumeMount format
servingEngineSpec.modelSpec[].serviceAccountName string "" (Optional) The name of the service account to use for the deployment
servingEngineSpec.modelSpec[].priorityClassName string "" Priority class name for the deployment
servingEngineSpec.modelSpec[].hf_token string/map - (Optional) Hugging Face token configuration
servingEngineSpec.modelSpec[].env list - (Optional) Environment variables for the container
servingEngineSpec.modelSpec[].nodeName string - (Optional) Direct node assignment
servingEngineSpec.modelSpec[].nodeSelectorTerms list - (Optional) Node selector terms

Init Container Configuration

Field Type Default Description
servingEngineSpec.modelSpec[].initContainer.name string "" The name of the init container, e.g., "init"
servingEngineSpec.modelSpec[].initContainer.image string "" The Docker image for the init container, e.g., "busybox:latest"
servingEngineSpec.modelSpec[].initContainer.command list [] (Optional) The command to run in the init container, e.g., ["sh", "-c"]
servingEngineSpec.modelSpec[].initContainer.args list [] (Optional) Additional arguments to pass to the command, e.g., ["ls"]
servingEngineSpec.modelSpec[].initContainer.env list [] (Optional) List of environment variables to set in the container
servingEngineSpec.modelSpec[].initContainer.resources map {} (Optional) The resource requests and limits for the container
servingEngineSpec.modelSpec[].initContainer.mountPvcStorage boolean false (Optional) Whether to mount the model's volume

vLLM Configuration

Field Type Default Description
servingEngineSpec.modelSpec[].vllmConfig.v0 integer - Specify to 1 to use vLLM v0, otherwise vLLM v1
servingEngineSpec.modelSpec[].vllmConfig.enablePrefixCaching boolean false Enable prefix caching
servingEngineSpec.modelSpec[].vllmConfig.enableChunkedPrefill boolean false Enable chunked prefill
servingEngineSpec.modelSpec[].vllmConfig.maxModelLen integer 4096 The maximum model length, e.g., 16384
servingEngineSpec.modelSpec[].vllmConfig.dtype string "fp16" The data type, e.g., "bfloat16"
servingEngineSpec.modelSpec[].vllmConfig.tensorParallelSize integer 1 The degree of tensor parallelism, e.g., 2
servingEngineSpec.modelSpec[].vllmConfig.maxNumSeqs integer 256 Maximum number of sequences to be processed in a single iteration
servingEngineSpec.modelSpec[].vllmConfig.maxLoras integer 0 The maximum number of LoRA models to be loaded in a single batch
servingEngineSpec.modelSpec[].vllmConfig.gpuMemoryUtilization number 0.9 The fraction of GPU memory to be used for the model executor (0-1)
servingEngineSpec.modelSpec[].vllmConfig.runner string "" The runner type for the model, can be "auto" or "pooling"
servingEngineSpec.modelSpec[].vllmConfig.convert string "" The conversion type for the model, can be "token_embed", "embed", "token_classify", "classify", or "score"
servingEngineSpec.modelSpec[].vllmConfig.extraArgs list ["--trust-remote-code"] Extra command line arguments to pass to vLLM

LMCache Configuration

Field Type Default Description
servingEngineSpec.modelSpec[].lmcacheConfig.enabled boolean false Enable LMCache
servingEngineSpec.modelSpec[].lmcacheConfig.cpuOffloadingBufferSize string "4" The CPU offloading buffer size, e.g., "30"
servingEngineSpec.modelSpec[].lmcacheConfig.diskOffloadingBufferSize string "" The disk offloading buffer size, e.g., "10Gi"
servingEngineSpec.modelSpec[].lmcacheConfig.enableController boolean true Enable LMCache controller for KV-aware routing
servingEngineSpec.modelSpec[].lmcacheConfig.instanceId string "default1" Unique instance identifier for controller
servingEngineSpec.modelSpec[].lmcacheConfig.controllerPort string "9000" Controller port for KV coordination
servingEngineSpec.modelSpec[].lmcacheConfig.workerPort integer 8001 Worker port for cache communication
servingEngineSpec.modelSpec[].lmcacheConfig.kvRole string - KV cache role (for disaggregated prefill) - "kv_producer" or "kv_consumer"
servingEngineSpec.modelSpec[].lmcacheConfig.enableNixl boolean true Enable NIXL protocol for KV transfer
servingEngineSpec.modelSpec[].lmcacheConfig.nixlRole string - NIXL role for distributed caching - "sender" or "receiver"
servingEngineSpec.modelSpec[].lmcacheConfig.nixlPeerHost string "decode-service" NIXL peer host for KV transfer
servingEngineSpec.modelSpec[].lmcacheConfig.nixlPeerPort string "55555" NIXL peer port for KV transfer
servingEngineSpec.modelSpec[].lmcacheConfig.nixlBufferSize string "1073741824" NIXL buffer size for KV transfer
servingEngineSpec.modelSpec[].lmcacheConfig.logLevel string "info" Log level for LMCache

KEDA Autoscaling Configuration

Note: Unless explicitly set, KEDA's default values will apply. The defaults shown below are KEDA's defaults, not values enforced by this Helm chart.

Field Type KEDA Default Description
servingEngineSpec.modelSpec[].keda.enabled boolean false Enable KEDA autoscaling for this model deployment (requires KEDA installed in cluster)
servingEngineSpec.modelSpec[].keda.minReplicaCount integer - Minimum number of replicas (supports 0 for scale-to-zero); if not set, HPA minReplicas default applies
servingEngineSpec.modelSpec[].keda.maxReplicaCount integer - Maximum number of replicas; if not set, HPA maxReplicas default applies
servingEngineSpec.modelSpec[].keda.pollingInterval integer 30 How often KEDA checks metrics (in seconds)
servingEngineSpec.modelSpec[].keda.cooldownPeriod integer 300 Wait time before scaling down after scaling up (in seconds)
servingEngineSpec.modelSpec[].keda.idleReplicaCount integer - Number of replicas when no triggers are active
servingEngineSpec.modelSpec[].keda.initialCooldownPeriod integer - Initial cooldown period before scaling down after creation (in seconds)
servingEngineSpec.modelSpec[].keda.fallback map - Fallback configuration when scaler fails
servingEngineSpec.modelSpec[].keda.fallback.failureThreshold integer - Number of consecutive failures before fallback
servingEngineSpec.modelSpec[].keda.fallback.replicas integer - Number of replicas to scale to in fallback
servingEngineSpec.modelSpec[].keda.triggers list See below List of KEDA trigger configurations (Prometheus-based)
servingEngineSpec.modelSpec[].keda.triggers[].type string - Trigger type (e.g., "prometheus")
servingEngineSpec.modelSpec[].keda.triggers[].metadata.serverAddress string - Prometheus server URL (e.g., http://prometheus-operated.monitoring.svc:9090)
servingEngineSpec.modelSpec[].keda.triggers[].metadata.metricName string - Name of the metric to monitor
servingEngineSpec.modelSpec[].keda.triggers[].metadata.query string - PromQL query to fetch the metric
servingEngineSpec.modelSpec[].keda.triggers[].metadata.threshold string - Threshold value that triggers scaling
servingEngineSpec.modelSpec[].keda.advanced map - Advanced KEDA configuration options
servingEngineSpec.modelSpec[].keda.advanced.restoreToOriginalReplicaCount boolean false Restore original replica count when ScaledObject is deleted
servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig map - HPA-specific configuration
servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.name string keda-hpa-{scaled-object-name} Custom name for HPA resource
servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.behavior map - HPA scaling behavior configuration (see K8s docs)
servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers map - Scaling modifiers for composite metrics
servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.target string - Target value for the composed metric
servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.activationTarget string - Activation target for the composed metric
servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.metricType string "AverageValue" Metric type (AverageValue or Value)
servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.formula string - Formula to compose metrics together

Serving Engine Monitoring Configuration

Field Type Default Description
servingEngineSpec.serviceMonitor.enabled boolean false Specifies whether to create a ServiceMonitor resource for collecting Prometheus metrics
servingEngineSpec.serviceMonitor.additionalLabels map {} Additional labels
servingEngineSpec.serviceMonitor.interval string 30s Interval to scrape metrics
servingEngineSpec.serviceMonitor.scrapeTimeout string 25s Timeout if metrics can't be retrieved in given time interval
servingEngineSpec.serviceMonitor.honorLabels boolean false Let prometheus add an exported_ prefix to conflicting labels
servingEngineSpec.serviceMonitor.metricRelabelings list [] Metric relabel configs to apply to samples before ingestion. Metric Relabeling
servingEngineSpec.serviceMonitor.relabelings list [] Relabel configs to apply to samples before ingestion. Relabeling

Router Configuration

Field Type Default Description
routerSpec.enableRouter boolean true Whether to enable the router service
routerSpec.repository string "lmcache/lmstack-router" Docker image repository for the router
routerSpec.tag string "latest" Docker image tag for the router
routerSpec.imagePullPolicy string "Always" Image pull policy for the router
routerSpec.imagePullSecrets list [] Image pull secrets for private container registries
routerSpec.replicaCount integer 1 Number of replicas for the router pod
routerSpec.pdb.enabled boolean false Whether to create a PodDisruptionBudget for the model
routerSpec.pdb.labels map {} Labels to add to the PodDisruptionBudget
routerSpec.pdb.annotations map {} Annotations to add to the PodDisruptionBudget
routerSpec.pdb.minAvailable string "" Number of pods that are available after eviction as number or percentage (eg.: 50%)
routerSpec.pdb.maxUnavailable string "" Number of pods that are unavailable after eviction as number or percentage (eg.: 50%).
routerSpec.priorityClassName string "" Priority class for router
routerSpec.containerPort integer 8000 Port the router container is listening on
routerSpec.serviceType string "ClusterIP" Kubernetes service type for the router
routerSpec.serviceAnnotations map {} Service annotations for LoadBalancer/NodePort
routerSpec.servicePort integer 80 Port the router service will listen on
routerSpec.serviceDiscovery string "k8s" Service discovery mode ("k8s" or "static")
routerSpec.k8sServiceDiscoveryType string "pod-ip" Service discovery Type ("pod-ip" or "service-name") if serviceDiscovery is "k8s"
routerSpec.staticBackends string "" Comma-separated list of backend addresses if serviceDiscovery is "static"
routerSpec.staticModels string "" Comma-separated list of model names if serviceDiscovery is "static"
routerSpec.routingLogic string "roundrobin" Routing logic: "roundrobin", "session", "prefixaware", or "kvaware"
routerSpec.sessionKey string "" Session key if using "session" routing logic
routerSpec.extraArgs list [] Extra command line arguments to pass to the router
routerSpec.engineScrapeInterval integer 15 Interval in seconds to scrape metrics from the serving engine
routerSpec.requestStatsWindow integer 60 Window size in seconds for calculating request statistics
routerSpec.strategy map {} Deployment strategy for the router pods
routerSpec.vllmApiKey string/map null (Optional) API key for securing vLLM models
routerSpec.resources.requests.cpu string "4" CPU requests for router
routerSpec.resources.requests.memory string "16G" Memory requests for router
routerSpec.resources.limits.cpu string "8" CPU limits for router
routerSpec.resources.limits.memory string "32G" Memory limits for router
routerSpec.labels map {environment: "router", release: "router"} Customized labels for the router deployment
routerSpec.podAnnotations map {} (Optional) Annotations to add to the pod, e.g., {model: "opt125m"}
routerSpec.affinity map {} (Optional) Affinity configuration. If specified, this takes precedence over nodeSelectorTerms.
routerSpec.nodeSelectorTerms list [] (Optional) Node selector terms. This is ignored if affinity is specified.
routerSpec.hf_token string "" Hugging Face token for router
routerSpec.lmcacheControllerPort integer "" LMCache controller port, used when routingLogic is "kvaware" (e.g. 9000)
routerSpec.lmcacheConfig.logLevel string "INFO" Log level for LMCache in the router when routingLogic is kvaware
routerSpec.livenessProbe.initialDelaySeconds integer 30 Initial delay in seconds for router's liveness probe
routerSpec.livenessProbe.periodSeconds integer 5 Interval in seconds for router's liveness probe
routerSpec.livenessProbe.failureThreshold integer 3 Failure threshold for router's liveness probe
routerSpec.livenessProbe.httpGet.path string "/health" Endpoint that the router's liveness probe will be testing
routerSpec.startupProbe.initialDelaySeconds integer 5 Initial delay in seconds for router's startup probe
routerSpec.startupProbe.periodSeconds integer 5 Interval in seconds for router's startup probe
routerSpec.startupProbe.failureThreshold integer 3 Failure threshold for router's startup probe
routerSpec.startupProbe.httpGet.path string "/health" Endpoint that the router's startup probe will be testing
routerSpec.readinessProbe.initialDelaySeconds integer 30 Initial delay in seconds for router's readiness probe
routerSpec.readinessProbe.periodSeconds integer 5 Interval in seconds for router's readiness probe
routerSpec.readinessProbe.failureThreshold integer 3 Failure threshold for router's readiness probe
routerSpec.readinessProbe.httpGet.path string "/health" Endpoint that the router's readiness probe will be testing

Router OpenTelemetry Configuration

Field Type Default Description
routerSpec.otel.endpoint string "" OTLP endpoint for tracing (e.g., "otel-collector:4317"). Tracing is enabled when this is set.
routerSpec.otel.serviceName string "vllm-router" Service name for OpenTelemetry traces
routerSpec.otel.secure boolean false Use secure (TLS) connection for OTLP exporter

Router Monitoring Configuration

Field Type Default Description
routerSpec.serviceMonitor.enabled boolean false Specifies whether to create a ServiceMonitor resource for collecting Prometheus metrics
routerSpec.serviceMonitor.additionalLabels map {} Additional labels
routerSpec.serviceMonitor.interval string 30s Interval to scrape metrics
routerSpec.serviceMonitor.scrapeTimeout string 25s Timeout if metrics can't be retrieved in given time interval
routerSpec.serviceMonitor.honorLabels boolean false Let prometheus add an exported_ prefix to conflicting labels
routerSpec.serviceMonitor.metricRelabelings list [] Metric relabel configs to apply to samples before ingestion. Metric Relabeling
routerSpec.serviceMonitor.relabelings list [] Relabel configs to apply to samples before ingestion. Relabeling

Router Ingress Configuration

Field Type Default Description
routerSpec.ingress.enabled boolean false Enable Ingress controller resource for the router
routerSpec.ingress.className string "" IngressClass to use for the router Ingress resource
routerSpec.ingress.annotations map {} Additional annotations for the router Ingress resource
routerSpec.ingress.hosts list [{host: "vllm-router.local", paths: [{path: /, pathType: Prefix}]}] List of hostnames covered by the router Ingress record
routerSpec.ingress.tls list [] TLS configuration for hostnames covered by the router Ingress record

Cache Server Configuration

Field Type Default Description
cacheserverSpec.enableServer boolean false Whether to enable the cache server deployment
cacheserverSpec.image.repository string "lmcache/lmstack-cache-server" Docker image repository for the cache server
cacheserverSpec.image.tag string "latest" Docker image tag for the cache server
cacheserverSpec.image.pullPolicy string "Always" Image pull policy for the cache server
cacheserverSpec.imagePullSecrets list [] Image pull secrets for private container registries
cacheserverSpec.replicaCount integer 1 Number of replicas for the cache server pod
cacheserverSpec.containerPort integer 8000 Port the cache server container is listening on
cacheserverSpec.serviceType string "ClusterIP" Kubernetes service type for the cache server
cacheserverSpec.servicePort integer 80 Port the cache server service will listen on
cacheserverSpec.resources.requests.cpu string "1" CPU requests for cache server
cacheserverSpec.resources.requests.memory string "2G" Memory requests for cache server
cacheserverSpec.resources.limits.cpu string "2" CPU limits for cache server
cacheserverSpec.resources.limits.memory string "4G" Memory limits for cache server
cacheserverSpec.labels map {environment: "cache", release: "cache"} Customized labels for the cache server deployment
cacheserverSpec.strategy map {} Deployment strategy for the cache server pods
cacheserverSpec.startupProbe map {initialDelaySeconds: 15, periodSeconds: 10, failureThreshold: 60, httpGet: {path: /health, port: 8000}} Configuration for the startup probe
cacheserverSpec.livenessProbe map {initialDelaySeconds: 15, periodSeconds: 10, failureThreshold: 3, httpGet: {path: /health, port: 8000}} Configuration for the liveness probe
cacheserverSpec.maxUnavailablePodDisruptionBudget string "" Configuration for the PodDisruptionBudget
cacheserverSpec.tolerations list [] Tolerations configuration for the cache server pods
cacheserverSpec.runtimeClassName string "" RuntimeClassName configuration for the cache server pods
cacheserverSpec.schedulerName string "" SchedulerName configuration for the cache server pods
cacheserverSpec.securityContext map {} Pod-level security context configuration
cacheserverSpec.containerSecurityContext map {runAsNonRoot: false} Container-level security context configuration
cacheserverSpec.priorityClassName string - Priority class for cache server
cacheserverSpec.affinity map - (Optional) Affinity configuration. If specified, this takes precedence over nodeSelectorTerms.
cacheserverSpec.nodeSelectorTerms list - (Optional) Node selector terms. This is ignored if affinity is specified.
cacheserverSpec.serde string - Serialization/deserialization format

cacheserverSpec.resources is passed through directly to the pod container resources block, so you can use extended resource keys (for example rdma/ib) in addition to cpu/memory.

LoRA Adapters Configuration

Field Type Default Description
loraAdapters list [] Array of LoRA adapter instances to deploy
loraAdapters[].name string - Name of the LoRA adapter instance
loraAdapters[].baseModel string - Name of the base model this adapter is for
loraAdapters[].vllmApiKey.secretRef.secretName string - Name of the secret containing API key
loraAdapters[].vllmApiKey.secretRef.secretKey string - Key in the secret containing API key
loraAdapters[].vllmApiKey.value string - Direct API key value
loraAdapters[].adapterSource.type string - Type of adapter source (local, s3, http, huggingface)
loraAdapters[].adapterSource.adapterName string - Name of the adapter to apply
loraAdapters[].adapterSource.adapterPath string - Path to the LoRA adapter weights
loraAdapters[].adapterSource.repository string - Repository to get the LoRA adapter from
loraAdapters[].adapterSource.pattern string - Pattern to use for the adapter name
loraAdapters[].adapterSource.maxAdapters integer - Maximum number of adapters to load
loraAdapters[].adapterSource.credentials.secretName string - Name of secret with storage credentials
loraAdapters[].adapterSource.credentials.secretKey string - Key in secret containing credentials
loraAdapters[].loraAdapterDeploymentConfig.algorithm string - Placement algorithm (default, ordered, equalized)
loraAdapters[].loraAdapterDeploymentConfig.replicas integer - Number of replicas that should load this adapter
loraAdapters[].labels map - Additional labels for the LoRA adapter

LoRA Controller Configuration

Field Type Default Description
loraController.enableLoraController boolean false Whether to enable the LoRA controller
loraController.kubernetesClusterDomain string "cluster.local" Kubernetes cluster domain
loraController.replicaCount integer 1 Number of LoRA controller replicas
loraController.image.repository string "lmcache/lmstack-lora-controller" Docker image repository
loraController.image.tag string "latest" Docker image tag
loraController.image.pullPolicy string "IfNotPresent" Image pull policy
loraController.imagePullSecrets list [] Image pull secrets
loraController.annotations map {} Deployment annotations
loraController.labels map {} Deployment labels
loraController.podAnnotations map {} Pod annotations
loraController.podLabels map {} Pod labels
loraController.podSecurityContext.runAsNonRoot boolean true Run as non-root user
loraController.podSecurityContext.seccompProfile.type string RuntimeDefault Seccomp profile type
loraController.containerSecurityContext.allowPrivilegeEscalation boolean false Allow privilege escalation
loraController.containerSecurityContext.capabilities.drop list ["ALL"] Drop capabilities
loraController.resources map {} Resource requests and limits
loraController.nodeSelector map {} Node selector
loraController.affinity map {} Affinity configuration
loraController.tolerations list [] Tolerations configuration
loraController.env list [] Environment variables
loraController.extraArgs list [] Extra arguments for the controller
loraController.metrics.enabled boolean true Whether to expose lora controller metrics
loraController.pdb.enabled boolean false Whether to create a PodDisruptionBudget for the loraController
loraController.pdb.labels map {} Labels to add to the PodDisruptionBudget
loraController.pdb.annotations map {} Annotations to add to the PodDisruptionBudget
loraController.pdb.minAvailable string "" Number of pods that are available after eviction as number or percentage (eg.: 50%)
loraController.pdb.maxUnavailable string "" Number of pods that are unavailable after eviction as number or percentage (eg.: 50%).

Shared Storage Configuration

Field Type Default Description
sharedStorage.enabled boolean false Whether to enable shared storage for the models
sharedStorage.size string "100Gi" Size of the shared storage volume
sharedStorage.accessModes list ["ReadWriteOnce"] Access modes for the shared storage volume
sharedStorage.storageClass string "standard" Storage class name for the shared storage volume
sharedStorage.hostPath string "" Host path for the shared storage volume (for local testing only)
sharedStorage.nfs.server string "" NFS server address for the shared storage volume
sharedStorage.nfs.path string "" NFS export path for the shared storage volume

Other Configuration

Field Type Default Description
grafanaDashboards.enabled boolean false Whether to deploy grafana dashboards as configmaps.
grafanaDashboards.annotations map {} Annotations to add to the configmaps.
grafanaDashboards.labels map {grafana_dashboard: "1"} Labels for the configmaps
extraObjects list [] Array of extra K8s manifests to deploy. Supports use of custom Helm templates

Observability

Grafana dashboard to monitor the deployment

Deploy the observability stack

On a cluster with prometheus operator installed

Install the chart with the following helm values to create the serviceMonitor and grafana dashboards.

servingEngineSpec:
  serviceMonitor:
    enabled: true
routerSpec:
  serviceMonitor:
    enabled: true
grafanaDashboards:
  enabled: true

On an empty cluster

The vllm-stack chart embeds kube-prometheus-stack as a subchart. Install the chart with the following helm values to deploy prometheus and grafana

servingEngineSpec:
  serviceMonitor:
    enabled: true
routerSpec:
  serviceMonitor:
    enabled: true
grafanaDashboards:
  enabled: true

kube-prometheus-stack:
  enabled: true

Access the Grafana UI

Forward the Grafana dashboard port to the local node-port

kubectl port-forward svc/<release-name>-grafana 8080:80

Open the webpage at http://<IP of your node>:8080 to access the Grafana web page. The default user name is admin and the password can be configured in the values (default is generated by helm and stored in a secret <release-name>-grafana).

Use Prometheus Adapter to export vLLM metrics

The vLLM router can export metrics to Prometheus using the Prometheus Adapter.

Install the chart with the following helm values to deploy prometheus-adapter

prometheus-adapter:
  enabled: true

We provide a minimal example of how to use the Prometheus Adapter to export vLLM metrics. See values.yaml for more details.

The exported metrics can be used for different purposes, such as horizontal scaling of the vLLM deployments.

To verify the metrics are being exported, you can use the following command:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | grep vllm_num_requests_waiting -C 10

You should see something like the following:

    {
      "name": "namespaces/vllm_num_requests_waiting",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }

The following command will show the current value of the metric:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/metrics/vllm_num_requests_waiting | jq

The output should look like the following:

{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Namespace",
        "name": "default",
        "apiVersion": "/v1"
      },
      "metricName": "vllm_num_requests_waiting",
      "timestamp": "2025-03-02T01:56:01Z",
      "value": "0",
      "selector": null
    }
  ]
}