vLLM Production Stack helm chart

This helm chart lets users deploy multiple serving engines and a router into the Kubernetes cluster.

Key features

Support running multiple serving engines with multiple different models
Load the model weights directly from the existing PersistentVolumes

Prerequisites

A running Kubernetes cluster with GPU. (You can set it up through minikube: https://minikube.sigs.k8s.io/docs/tutorials/nvidia/)
Helm

Install the helm chart

helm dependency build
helm install llmstack . -f values-example.yaml

Uninstall the chart

helm uninstall llmstack

Configure the deployments

See helm/values.yaml for mode details.

Production Stack Helm Chart Values Reference

This table documents all available configuration values for the Production Stack Helm chart.

Serving Engine Configuration
Router Configuration
Cache Server Configuration
LoRA Adapters Configuration
LoRA Controller Configuration
Shared Storage Configuration
Other Configuration

Serving Engine Configuration

Field	Type	Default	Description
`servingEngineSpec.enableEngine`	boolean	`true`	Whether to enable the serving engine deployment
`servingEngineSpec.labels`	map	`{environment: "test", release: "test"}`	Customized labels for the serving engine deployment
`servingEngineSpec.vllmApiKey`	string/map	`null`	(Optional) API key for securing vLLM models. Can be a direct string or an object referencing an existing secret
`servingEngineSpec.modelSpec`	list	`[]`	Array of specifications for configuring multiple serving engine deployments running different models
`servingEngineSpec.containerPort`	integer	`8000`	Port the vLLM server container is listening on
`servingEngineSpec.servicePort`	integer	`80`	Port the service will listen on
`servingEngineSpec.configs`	map	`{}`	Set other environment variables from a config map
`servingEngineSpec.strategy`	map	`{}`	Deployment strategy for the serving engine pods
`servingEngineSpec.maxUnavailablePodDisruptionBudget`	string	`""`	Configuration for the PodDisruptionBudget for the serving engine pods
`servingEngineSpec.tolerations`	list	`[]`	Tolerations configuration for the serving engine pods (when there are taints on nodes)
`servingEngineSpec.runtimeClassName`	string	`"nvidia"`	RuntimeClassName configuration (set to "nvidia" if using GPU)
`servingEngineSpec.schedulerName`	string	`""`	SchedulerName configuration for the serving engine pods
`servingEngineSpec.securityContext`	map	`{}`	Pod-level security context configuration for the serving engine pods
`servingEngineSpec.containerSecurityContext`	map	`{runAsNonRoot: false}`	Container-level security context configuration for the serving engine container
`servingEngineSpec.extraPorts`	list	`[]`	List of additional ports to expose for the serving engine container
`servingEngineSpec.startupProbe.initialDelaySeconds`	integer	`15`	Number of seconds after container starts before startup probe is initiated
`servingEngineSpec.startupProbe.periodSeconds`	integer	`10`	How often (in seconds) to perform the startup probe
`servingEngineSpec.startupProbe.failureThreshold`	integer	`60`	Number of failures before considering failed
`servingEngineSpec.startupProbe.httpGet.path`	string	`"/health"`	Path to access on the HTTP server
`servingEngineSpec.startupProbe.httpGet.port`	integer	`8000`	Port to access on the container
`servingEngineSpec.livenessProbe.initialDelaySeconds`	integer	`15`	Number of seconds after container starts before liveness probe is initiated
`servingEngineSpec.livenessProbe.periodSeconds`	integer	`10`	How often (in seconds) to perform the liveness probe
`servingEngineSpec.livenessProbe.failureThreshold`	integer	`3`	Number of failures before considering failed
`servingEngineSpec.livenessProbe.httpGet.path`	string	`"/health"`	Path to access on the HTTP server
`servingEngineSpec.livenessProbe.httpGet.port`	integer	`8000`	Port to access on the container
`servingEngineSpec.imagePullPolicy`	string	`"Always"`	Image pull policy for serving engine
`servingEngineSpec.extraVolumes`	list	`[]`	Extra volumes for serving engine
`servingEngineSpec.extraVolumeMounts`	list	`[]`	Extra volume mounts for serving engine
`servingEngineSpec.env`	list	`[]`	(Optional) Global environment variables for all serving engine containers. If a variable is set in both `servingEngineSpec.env` and `servingEngineSpec.modelSpec[].env`, the value from `modelSpec[].env` will override the global value for that model.

Model Specification Fields

Field	Type	Default	Description
`servingEngineSpec.modelSpec[].annotations`	map	`{}`	(Optional) Annotations to add to the deployment, e.g., {model: "opt125m"}
`servingEngineSpec.modelSpec[].podAnnotations`	map	`{}`	(Optional) Annotations to add to the pod, e.g., {model: "opt125m"}
`servingEngineSpec.modelSpec[].name`	string	`""`	The name of the model, e.g., "example-model"
`servingEngineSpec.modelSpec[].repository`	string	`""`	The repository of the model, e.g., "vllm/vllm-openai"
`servingEngineSpec.modelSpec[].tag`	string	`""`	The tag of the model, e.g., "latest"
`servingEngineSpec.modelSpec[].imagePullSecret`	string	`""`	(Optional) Name of secret with credentials to private container repository
`servingEngineSpec.modelSpec[].modelURL`	string	`""`	The URL of the model, e.g., "facebook/opt-125m"
`servingEngineSpec.modelSpec[].chatTemplate`	string	`null`	(Optional) Chat template (Jinja2) specifying tokenizer configuration
`servingEngineSpec.modelSpec[].replicaCount`	integer	`1`	The number of replicas for the model
`servingEngineSpec.modelSpec[].pdb.enabled`	boolean	`false`	Whether to create a PodDisruptionBudget for the model
`servingEngineSpec.modelSpec[].pdb.labels`	map	`{}`	Labels to add to the PodDisruptionBudget
`servingEngineSpec.modelSpec[].pdb.annotations`	map	`{}`	Annotations to add to the PodDisruptionBudget
`servingEngineSpec.modelSpec[].pdb.minAvailable`	string	`""`	Number of pods that are available after eviction as number or percentage (eg.: 50%)
`servingEngineSpec.modelSpec[].pdb.maxUnavailable`	string	`""`	Number of pods that are unavailable after eviction as number or percentage (eg.: 50%).
`servingEngineSpec.modelSpec[].resources`	object	`{}`	Standard Kubernetes resources block (requests/limits). If specified, this takes priority over and ignores simplified resource fields (requestCPU, requestMemory, requestGPU, etc.)
`servingEngineSpec.modelSpec[].requestCPU`	integer	`0`	The number of CPUs requested for the model
`servingEngineSpec.modelSpec[].requestMemory`	string	`""`	The amount of memory requested for the model, e.g., "16Gi"
`servingEngineSpec.modelSpec[].requestGPU`	integer	`0`	The number of GPUs requested for the model
`servingEngineSpec.modelSpec[].requestGPUType`	string	`"nvidia.com/gpu"`	(Optional) The type of GPU requested, e.g., "nvidia.com/mig-4g.71gb"
`servingEngineSpec.modelSpec[].limitCPU`	string	`""`	(Optional) The CPU limit for the model, e.g., "8"
`servingEngineSpec.modelSpec[].limitMemory`	string	`""`	(Optional) The memory limit for the model, e.g., "32Gi"
`servingEngineSpec.modelSpec[].shmSize`	string	`"20Gi"`	Size of the shared memory for the serving engine container (applied when tensor parallelism is enabled)
`servingEngineSpec.modelSpec[].enableLoRA`	boolean	`true`	(Optional) Whether to enable LoRA
`servingEngineSpec.modelSpec[].pvcStorage`	string	`""`	(Optional) The amount of storage requested for the model, e.g., "50Gi"
`servingEngineSpec.modelSpec[].pvcAccessMode`	list	`[]`	(Optional) The access mode policy for the mounted volume, e.g., ["ReadWriteOnce"]
`servingEngineSpec.modelSpec[].storageClass`	string	`""`	(Optional) The storage class of the PVC
`servingEngineSpec.modelSpec[].pvcMatchLabels`	map	`{}`	(Optional) The labels to match the PVC, e.g., {model: "opt125m"}
`servingEngineSpec.modelSpec[].pvcLabels`	map	`{}`	(Optional) The labels to add to the PVC, e.g., {label_excluded_from_alerts: "true"}
`servingEngineSpec.modelSpec[].pvcAnnotations`	map	`{}`	(Optional) The annotations to add to the PVC
`servingEngineSpec.modelSpec[].extraVolumes`	list	`[]`	(Optional) Additional volumes to add to the pod, in Kubernetes volume format
`servingEngineSpec.modelSpec[].extraVolumeMounts`	list	`[]`	(Optional) Additional volume mounts to add to the container, in Kubernetes volumeMount format
`servingEngineSpec.modelSpec[].serviceAccountName`	string	`""`	(Optional) The name of the service account to use for the deployment
`servingEngineSpec.modelSpec[].priorityClassName`	string	`""`	Priority class name for the deployment
`servingEngineSpec.modelSpec[].hf_token`	string/map	-	(Optional) Hugging Face token configuration
`servingEngineSpec.modelSpec[].env`	list	-	(Optional) Environment variables for the container
`servingEngineSpec.modelSpec[].nodeName`	string	-	(Optional) Direct node assignment
`servingEngineSpec.modelSpec[].nodeSelectorTerms`	list	-	(Optional) Node selector terms

Init Container Configuration

Field	Type	Default	Description
`servingEngineSpec.modelSpec[].initContainer.name`	string	`""`	The name of the init container, e.g., "init"
`servingEngineSpec.modelSpec[].initContainer.image`	string	`""`	The Docker image for the init container, e.g., "busybox:latest"
`servingEngineSpec.modelSpec[].initContainer.command`	list	`[]`	(Optional) The command to run in the init container, e.g., ["sh", "-c"]
`servingEngineSpec.modelSpec[].initContainer.args`	list	`[]`	(Optional) Additional arguments to pass to the command, e.g., ["ls"]
`servingEngineSpec.modelSpec[].initContainer.env`	list	`[]`	(Optional) List of environment variables to set in the container
`servingEngineSpec.modelSpec[].initContainer.resources`	map	`{}`	(Optional) The resource requests and limits for the container
`servingEngineSpec.modelSpec[].initContainer.mountPvcStorage`	boolean	`false`	(Optional) Whether to mount the model's volume

vLLM Configuration

Field	Type	Default	Description
`servingEngineSpec.modelSpec[].vllmConfig.v0`	integer	-	Specify to 1 to use vLLM v0, otherwise vLLM v1
`servingEngineSpec.modelSpec[].vllmConfig.enablePrefixCaching`	boolean	`false`	Enable prefix caching
`servingEngineSpec.modelSpec[].vllmConfig.enableChunkedPrefill`	boolean	`false`	Enable chunked prefill
`servingEngineSpec.modelSpec[].vllmConfig.maxModelLen`	integer	`4096`	The maximum model length, e.g., 16384
`servingEngineSpec.modelSpec[].vllmConfig.dtype`	string	`"fp16"`	The data type, e.g., "bfloat16"
`servingEngineSpec.modelSpec[].vllmConfig.tensorParallelSize`	integer	`1`	The degree of tensor parallelism, e.g., 2
`servingEngineSpec.modelSpec[].vllmConfig.maxNumSeqs`	integer	`256`	Maximum number of sequences to be processed in a single iteration
`servingEngineSpec.modelSpec[].vllmConfig.maxLoras`	integer	`0`	The maximum number of LoRA models to be loaded in a single batch
`servingEngineSpec.modelSpec[].vllmConfig.gpuMemoryUtilization`	number	`0.9`	The fraction of GPU memory to be used for the model executor (0-1)
`servingEngineSpec.modelSpec[].vllmConfig.runner`	string	`""`	The runner type for the model, can be "auto" or "pooling"
`servingEngineSpec.modelSpec[].vllmConfig.convert`	string	`""`	The conversion type for the model, can be "token_embed", "embed", "token_classify", "classify", or "score"
`servingEngineSpec.modelSpec[].vllmConfig.extraArgs`	list	`["--trust-remote-code"]`	Extra command line arguments to pass to vLLM

LMCache Configuration

Field	Type	Default	Description
`servingEngineSpec.modelSpec[].lmcacheConfig.enabled`	boolean	`false`	Enable LMCache
`servingEngineSpec.modelSpec[].lmcacheConfig.cpuOffloadingBufferSize`	string	`"4"`	The CPU offloading buffer size, e.g., "30"
`servingEngineSpec.modelSpec[].lmcacheConfig.diskOffloadingBufferSize`	string	`""`	The disk offloading buffer size, e.g., "10Gi"
`servingEngineSpec.modelSpec[].lmcacheConfig.enableController`	boolean	`true`	Enable LMCache controller for KV-aware routing
`servingEngineSpec.modelSpec[].lmcacheConfig.instanceId`	string	`"default1"`	Unique instance identifier for controller
`servingEngineSpec.modelSpec[].lmcacheConfig.controllerPort`	string	`"9000"`	Controller port for KV coordination
`servingEngineSpec.modelSpec[].lmcacheConfig.workerPort`	integer	`8001`	Worker port for cache communication
`servingEngineSpec.modelSpec[].lmcacheConfig.kvRole`	string	-	KV cache role (for disaggregated prefill) - "kv_producer" or "kv_consumer"
`servingEngineSpec.modelSpec[].lmcacheConfig.enableNixl`	boolean	`true`	Enable NIXL protocol for KV transfer
`servingEngineSpec.modelSpec[].lmcacheConfig.nixlRole`	string	-	NIXL role for distributed caching - "sender" or "receiver"
`servingEngineSpec.modelSpec[].lmcacheConfig.nixlPeerHost`	string	`"decode-service"`	NIXL peer host for KV transfer
`servingEngineSpec.modelSpec[].lmcacheConfig.nixlPeerPort`	string	`"55555"`	NIXL peer port for KV transfer
`servingEngineSpec.modelSpec[].lmcacheConfig.nixlBufferSize`	string	`"1073741824"`	NIXL buffer size for KV transfer
`servingEngineSpec.modelSpec[].lmcacheConfig.logLevel`	string	`"info"`	Log level for LMCache

KEDA Autoscaling Configuration

Note: Unless explicitly set, KEDA's default values will apply. The defaults shown below are KEDA's defaults, not values enforced by this Helm chart.

Field	Type	KEDA Default	Description
`servingEngineSpec.modelSpec[].keda.enabled`	boolean	`false`	Enable KEDA autoscaling for this model deployment (requires KEDA installed in cluster)
`servingEngineSpec.modelSpec[].keda.minReplicaCount`	integer	-	Minimum number of replicas (supports 0 for scale-to-zero); if not set, HPA minReplicas default applies
`servingEngineSpec.modelSpec[].keda.maxReplicaCount`	integer	-	Maximum number of replicas; if not set, HPA maxReplicas default applies
`servingEngineSpec.modelSpec[].keda.pollingInterval`	integer	`30`	How often KEDA checks metrics (in seconds)
`servingEngineSpec.modelSpec[].keda.cooldownPeriod`	integer	`300`	Wait time before scaling down after scaling up (in seconds)
`servingEngineSpec.modelSpec[].keda.idleReplicaCount`	integer	-	Number of replicas when no triggers are active
`servingEngineSpec.modelSpec[].keda.initialCooldownPeriod`	integer	-	Initial cooldown period before scaling down after creation (in seconds)
`servingEngineSpec.modelSpec[].keda.fallback`	map	-	Fallback configuration when scaler fails
`servingEngineSpec.modelSpec[].keda.fallback.failureThreshold`	integer	-	Number of consecutive failures before fallback
`servingEngineSpec.modelSpec[].keda.fallback.replicas`	integer	-	Number of replicas to scale to in fallback
`servingEngineSpec.modelSpec[].keda.triggers`	list	See below	List of KEDA trigger configurations (Prometheus-based)
`servingEngineSpec.modelSpec[].keda.triggers[].type`	string	-	Trigger type (e.g., `"prometheus"`)
`servingEngineSpec.modelSpec[].keda.triggers[].metadata.serverAddress`	string	-	Prometheus server URL (e.g., http://prometheus-operated.monitoring.svc:9090)
`servingEngineSpec.modelSpec[].keda.triggers[].metadata.metricName`	string	-	Name of the metric to monitor
`servingEngineSpec.modelSpec[].keda.triggers[].metadata.query`	string	-	PromQL query to fetch the metric
`servingEngineSpec.modelSpec[].keda.triggers[].metadata.threshold`	string	-	Threshold value that triggers scaling
`servingEngineSpec.modelSpec[].keda.advanced`	map	-	Advanced KEDA configuration options
`servingEngineSpec.modelSpec[].keda.advanced.restoreToOriginalReplicaCount`	boolean	`false`	Restore original replica count when ScaledObject is deleted
`servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig`	map	-	HPA-specific configuration
`servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.name`	string	`keda-hpa-{scaled-object-name}`	Custom name for HPA resource
`servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.behavior`	map	-	HPA scaling behavior configuration (see K8s docs)
`servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers`	map	-	Scaling modifiers for composite metrics
`servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.target`	string	-	Target value for the composed metric
`servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.activationTarget`	string	-	Activation target for the composed metric
`servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.metricType`	string	`"AverageValue"`	Metric type (AverageValue or Value)
`servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.formula`	string	-	Formula to compose metrics together

Serving Engine Monitoring Configuration

Field	Type	Default	Description
`servingEngineSpec.serviceMonitor.enabled`	boolean	`false`	Specifies whether to create a ServiceMonitor resource for collecting Prometheus metrics
`servingEngineSpec.serviceMonitor.additionalLabels`	map	`{}`	Additional labels
`servingEngineSpec.serviceMonitor.interval`	string	`30s`	Interval to scrape metrics
`servingEngineSpec.serviceMonitor.scrapeTimeout`	string	`25s`	Timeout if metrics can't be retrieved in given time interval
`servingEngineSpec.serviceMonitor.honorLabels`	boolean	`false`	Let prometheus add an exported_ prefix to conflicting labels
`servingEngineSpec.serviceMonitor.metricRelabelings`	list	`[]`	Metric relabel configs to apply to samples before ingestion. Metric Relabeling
`servingEngineSpec.serviceMonitor.relabelings`	list	`[]`	Relabel configs to apply to samples before ingestion. Relabeling

Router Configuration

Field	Type	Default	Description
`routerSpec.enableRouter`	boolean	`true`	Whether to enable the router service
`routerSpec.repository`	string	`"lmcache/lmstack-router"`	Docker image repository for the router
`routerSpec.tag`	string	`"latest"`	Docker image tag for the router
`routerSpec.imagePullPolicy`	string	`"Always"`	Image pull policy for the router
`routerSpec.imagePullSecrets`	list	`[]`	Image pull secrets for private container registries
`routerSpec.replicaCount`	integer	`1`	Number of replicas for the router pod
`routerSpec.pdb.enabled`	boolean	`false`	Whether to create a PodDisruptionBudget for the model
`routerSpec.pdb.labels`	map	`{}`	Labels to add to the PodDisruptionBudget
`routerSpec.pdb.annotations`	map	`{}`	Annotations to add to the PodDisruptionBudget
`routerSpec.pdb.minAvailable`	string	`""`	Number of pods that are available after eviction as number or percentage (eg.: 50%)
`routerSpec.pdb.maxUnavailable`	string	`""`	Number of pods that are unavailable after eviction as number or percentage (eg.: 50%).
`routerSpec.priorityClassName`	string	`""`	Priority class for router
`routerSpec.containerPort`	integer	`8000`	Port the router container is listening on
`routerSpec.serviceType`	string	`"ClusterIP"`	Kubernetes service type for the router
`routerSpec.serviceAnnotations`	map	`{}`	Service annotations for LoadBalancer/NodePort
`routerSpec.servicePort`	integer	`80`	Port the router service will listen on
`routerSpec.serviceDiscovery`	string	`"k8s"`	Service discovery mode ("k8s" or "static")
`routerSpec.k8sServiceDiscoveryType`	string	`"pod-ip"`	Service discovery Type ("pod-ip" or "service-name") if serviceDiscovery is "k8s"
`routerSpec.staticBackends`	string	`""`	Comma-separated list of backend addresses if serviceDiscovery is "static"
`routerSpec.staticModels`	string	`""`	Comma-separated list of model names if serviceDiscovery is "static"
`routerSpec.routingLogic`	string	`"roundrobin"`	Routing logic: `"roundrobin"`, `"session"`, `"prefixaware"`, or `"kvaware"`
`routerSpec.sessionKey`	string	`""`	Session key if using "session" routing logic
`routerSpec.extraArgs`	list	`[]`	Extra command line arguments to pass to the router
`routerSpec.engineScrapeInterval`	integer	`15`	Interval in seconds to scrape metrics from the serving engine
`routerSpec.requestStatsWindow`	integer	`60`	Window size in seconds for calculating request statistics
`routerSpec.strategy`	map	`{}`	Deployment strategy for the router pods
`routerSpec.vllmApiKey`	string/map	`null`	(Optional) API key for securing vLLM models
`routerSpec.resources.requests.cpu`	string	`"4"`	CPU requests for router
`routerSpec.resources.requests.memory`	string	`"16G"`	Memory requests for router
`routerSpec.resources.limits.cpu`	string	`"8"`	CPU limits for router
`routerSpec.resources.limits.memory`	string	`"32G"`	Memory limits for router
`routerSpec.labels`	map	`{environment: "router", release: "router"}`	Customized labels for the router deployment
`routerSpec.podAnnotations`	map	`{}`	(Optional) Annotations to add to the pod, e.g., {model: "opt125m"}
`routerSpec.affinity`	map	{}	(Optional) Affinity configuration. If specified, this takes precedence over `nodeSelectorTerms`.
`routerSpec.nodeSelectorTerms`	list	`[]`	(Optional) Node selector terms. This is ignored if `affinity` is specified.
`routerSpec.hf_token`	string	`""`	Hugging Face token for router
`routerSpec.lmcacheControllerPort`	integer	`""`	LMCache controller port, used when `routingLogic` is `"kvaware"` (e.g. `9000`)
`routerSpec.lmcacheConfig.logLevel`	string	`"INFO"`	Log level for LMCache in the router when routingLogic is kvaware
`routerSpec.livenessProbe.initialDelaySeconds`	integer	`30`	Initial delay in seconds for router's liveness probe
`routerSpec.livenessProbe.periodSeconds`	integer	`5`	Interval in seconds for router's liveness probe
`routerSpec.livenessProbe.failureThreshold`	integer	`3`	Failure threshold for router's liveness probe
`routerSpec.livenessProbe.httpGet.path`	string	`"/health"`	Endpoint that the router's liveness probe will be testing
`routerSpec.startupProbe.initialDelaySeconds`	integer	`5`	Initial delay in seconds for router's startup probe
`routerSpec.startupProbe.periodSeconds`	integer	`5`	Interval in seconds for router's startup probe
`routerSpec.startupProbe.failureThreshold`	integer	`3`	Failure threshold for router's startup probe
`routerSpec.startupProbe.httpGet.path`	string	`"/health"`	Endpoint that the router's startup probe will be testing
`routerSpec.readinessProbe.initialDelaySeconds`	integer	`30`	Initial delay in seconds for router's readiness probe
`routerSpec.readinessProbe.periodSeconds`	integer	`5`	Interval in seconds for router's readiness probe
`routerSpec.readinessProbe.failureThreshold`	integer	`3`	Failure threshold for router's readiness probe
`routerSpec.readinessProbe.httpGet.path`	string	`"/health"`	Endpoint that the router's readiness probe will be testing

Router OpenTelemetry Configuration

Field	Type	Default	Description
`routerSpec.otel.endpoint`	string	`""`	OTLP endpoint for tracing (e.g., "otel-collector:4317"). Tracing is enabled when this is set.
`routerSpec.otel.serviceName`	string	`"vllm-router"`	Service name for OpenTelemetry traces
`routerSpec.otel.secure`	boolean	`false`	Use secure (TLS) connection for OTLP exporter

Router Monitoring Configuration

Field	Type	Default	Description
`routerSpec.serviceMonitor.enabled`	boolean	`false`	Specifies whether to create a ServiceMonitor resource for collecting Prometheus metrics
`routerSpec.serviceMonitor.additionalLabels`	map	`{}`	Additional labels
`routerSpec.serviceMonitor.interval`	string	`30s`	Interval to scrape metrics
`routerSpec.serviceMonitor.scrapeTimeout`	string	`25s`	Timeout if metrics can't be retrieved in given time interval
`routerSpec.serviceMonitor.honorLabels`	boolean	`false`	Let prometheus add an exported_ prefix to conflicting labels
`routerSpec.serviceMonitor.metricRelabelings`	list	`[]`	Metric relabel configs to apply to samples before ingestion. Metric Relabeling
`routerSpec.serviceMonitor.relabelings`	list	`[]`	Relabel configs to apply to samples before ingestion. Relabeling

Router Ingress Configuration

Field	Type	Default	Description
`routerSpec.ingress.enabled`	boolean	`false`	Enable Ingress controller resource for the router
`routerSpec.ingress.className`	string	`""`	IngressClass to use for the router Ingress resource
`routerSpec.ingress.annotations`	map	`{}`	Additional annotations for the router Ingress resource
`routerSpec.ingress.hosts`	list	`[{host: "vllm-router.local", paths: [{path: /, pathType: Prefix}]}]`	List of hostnames covered by the router Ingress record
`routerSpec.ingress.tls`	list	`[]`	TLS configuration for hostnames covered by the router Ingress record

Cache Server Configuration

Field	Type	Default	Description
`cacheserverSpec.enableServer`	boolean	`false`	Whether to enable the cache server deployment
`cacheserverSpec.image.repository`	string	`"lmcache/lmstack-cache-server"`	Docker image repository for the cache server
`cacheserverSpec.image.tag`	string	`"latest"`	Docker image tag for the cache server
`cacheserverSpec.image.pullPolicy`	string	`"Always"`	Image pull policy for the cache server
`cacheserverSpec.imagePullSecrets`	list	`[]`	Image pull secrets for private container registries
`cacheserverSpec.replicaCount`	integer	`1`	Number of replicas for the cache server pod
`cacheserverSpec.containerPort`	integer	`8000`	Port the cache server container is listening on
`cacheserverSpec.serviceType`	string	`"ClusterIP"`	Kubernetes service type for the cache server
`cacheserverSpec.servicePort`	integer	`80`	Port the cache server service will listen on
`cacheserverSpec.resources.requests.cpu`	string	`"1"`	CPU requests for cache server
`cacheserverSpec.resources.requests.memory`	string	`"2G"`	Memory requests for cache server
`cacheserverSpec.resources.limits.cpu`	string	`"2"`	CPU limits for cache server
`cacheserverSpec.resources.limits.memory`	string	`"4G"`	Memory limits for cache server
`cacheserverSpec.labels`	map	`{environment: "cache", release: "cache"}`	Customized labels for the cache server deployment
`cacheserverSpec.strategy`	map	`{}`	Deployment strategy for the cache server pods
`cacheserverSpec.startupProbe`	map	`{initialDelaySeconds: 15, periodSeconds: 10, failureThreshold: 60, httpGet: {path: /health, port: 8000}}`	Configuration for the startup probe
`cacheserverSpec.livenessProbe`	map	`{initialDelaySeconds: 15, periodSeconds: 10, failureThreshold: 3, httpGet: {path: /health, port: 8000}}`	Configuration for the liveness probe
`cacheserverSpec.maxUnavailablePodDisruptionBudget`	string	`""`	Configuration for the PodDisruptionBudget
`cacheserverSpec.tolerations`	list	`[]`	Tolerations configuration for the cache server pods
`cacheserverSpec.runtimeClassName`	string	`""`	RuntimeClassName configuration for the cache server pods
`cacheserverSpec.schedulerName`	string	`""`	SchedulerName configuration for the cache server pods
`cacheserverSpec.securityContext`	map	`{}`	Pod-level security context configuration
`cacheserverSpec.containerSecurityContext`	map	`{runAsNonRoot: false}`	Container-level security context configuration
`cacheserverSpec.priorityClassName`	string	-	Priority class for cache server
`cacheserverSpec.affinity`	map	-	(Optional) Affinity configuration. If specified, this takes precedence over `nodeSelectorTerms`.
`cacheserverSpec.nodeSelectorTerms`	list	-	(Optional) Node selector terms. This is ignored if `affinity` is specified.
`cacheserverSpec.serde`	string	-	Serialization/deserialization format

cacheserverSpec.resources is passed through directly to the pod container resources block, so you can use extended resource keys (for example rdma/ib) in addition to cpu/memory.

LoRA Adapters Configuration

Field	Type	Default	Description
`loraAdapters`	list	`[]`	Array of LoRA adapter instances to deploy
`loraAdapters[].name`	string	-	Name of the LoRA adapter instance
`loraAdapters[].baseModel`	string	-	Name of the base model this adapter is for
`loraAdapters[].vllmApiKey.secretRef.secretName`	string	-	Name of the secret containing API key
`loraAdapters[].vllmApiKey.secretRef.secretKey`	string	-	Key in the secret containing API key
`loraAdapters[].vllmApiKey.value`	string	-	Direct API key value
`loraAdapters[].adapterSource.type`	string	-	Type of adapter source (local, s3, http, huggingface)
`loraAdapters[].adapterSource.adapterName`	string	-	Name of the adapter to apply
`loraAdapters[].adapterSource.adapterPath`	string	-	Path to the LoRA adapter weights
`loraAdapters[].adapterSource.repository`	string	-	Repository to get the LoRA adapter from
`loraAdapters[].adapterSource.pattern`	string	-	Pattern to use for the adapter name
`loraAdapters[].adapterSource.maxAdapters`	integer	-	Maximum number of adapters to load
`loraAdapters[].adapterSource.credentials.secretName`	string	-	Name of secret with storage credentials
`loraAdapters[].adapterSource.credentials.secretKey`	string	-	Key in secret containing credentials
`loraAdapters[].loraAdapterDeploymentConfig.algorithm`	string	-	Placement algorithm (default, ordered, equalized)
`loraAdapters[].loraAdapterDeploymentConfig.replicas`	integer	-	Number of replicas that should load this adapter
`loraAdapters[].labels`	map	-	Additional labels for the LoRA adapter

LoRA Controller Configuration

Field	Type	Default	Description
`loraController.enableLoraController`	boolean	`false`	Whether to enable the LoRA controller
`loraController.kubernetesClusterDomain`	string	`"cluster.local"`	Kubernetes cluster domain
`loraController.replicaCount`	integer	`1`	Number of LoRA controller replicas
`loraController.image.repository`	string	`"lmcache/lmstack-lora-controller"`	Docker image repository
`loraController.image.tag`	string	`"latest"`	Docker image tag
`loraController.image.pullPolicy`	string	`"IfNotPresent"`	Image pull policy
`loraController.imagePullSecrets`	list	`[]`	Image pull secrets
`loraController.annotations`	map	`{}`	Deployment annotations
`loraController.labels`	map	`{}`	Deployment labels
`loraController.podAnnotations`	map	`{}`	Pod annotations
`loraController.podLabels`	map	`{}`	Pod labels
`loraController.podSecurityContext.runAsNonRoot`	boolean	`true`	Run as non-root user
`loraController.podSecurityContext.seccompProfile.type`	string	`RuntimeDefault`	Seccomp profile type
`loraController.containerSecurityContext.allowPrivilegeEscalation`	boolean	`false`	Allow privilege escalation
`loraController.containerSecurityContext.capabilities.drop`	list	`["ALL"]`	Drop capabilities
`loraController.resources`	map	`{}`	Resource requests and limits
`loraController.nodeSelector`	map	`{}`	Node selector
`loraController.affinity`	map	`{}`	Affinity configuration
`loraController.tolerations`	list	`[]`	Tolerations configuration
`loraController.env`	list	`[]`	Environment variables
`loraController.extraArgs`	list	`[]`	Extra arguments for the controller
`loraController.metrics.enabled`	boolean	`true`	Whether to expose lora controller metrics
`loraController.pdb.enabled`	boolean	`false`	Whether to create a PodDisruptionBudget for the loraController
`loraController.pdb.labels`	map	`{}`	Labels to add to the PodDisruptionBudget
`loraController.pdb.annotations`	map	`{}`	Annotations to add to the PodDisruptionBudget
`loraController.pdb.minAvailable`	string	`""`	Number of pods that are available after eviction as number or percentage (eg.: 50%)
`loraController.pdb.maxUnavailable`	string	`""`	Number of pods that are unavailable after eviction as number or percentage (eg.: 50%).

Shared Storage Configuration

Field	Type	Default	Description
`sharedStorage.enabled`	boolean	`false`	Whether to enable shared storage for the models
`sharedStorage.size`	string	`"100Gi"`	Size of the shared storage volume
`sharedStorage.accessModes`	list	`["ReadWriteOnce"]`	Access modes for the shared storage volume
`sharedStorage.storageClass`	string	`"standard"`	Storage class name for the shared storage volume
`sharedStorage.hostPath`	string	`""`	Host path for the shared storage volume (for local testing only)
`sharedStorage.nfs.server`	string	`""`	NFS server address for the shared storage volume
`sharedStorage.nfs.path`	string	`""`	NFS export path for the shared storage volume

Other Configuration

Field	Type	Default	Description
`grafanaDashboards.enabled`	boolean	`false`	Whether to deploy grafana dashboards as configmaps.
`grafanaDashboards.annotations`	map	`{}`	Annotations to add to the configmaps.
`grafanaDashboards.labels`	map	`{grafana_dashboard: "1"}`	Labels for the configmaps
`extraObjects`	list	`[]`	Array of extra K8s manifests to deploy. Supports use of custom Helm templates

Observability

Deploy the observability stack

On a cluster with prometheus operator installed

Install the chart with the following helm values to create the serviceMonitor and grafana dashboards.

servingEngineSpec:
  serviceMonitor:
    enabled: true
routerSpec:
  serviceMonitor:
    enabled: true
grafanaDashboards:
  enabled: true

On an empty cluster

The vllm-stack chart embeds kube-prometheus-stack as a subchart. Install the chart with the following helm values to deploy prometheus and grafana

servingEngineSpec:
  serviceMonitor:
    enabled: true
routerSpec:
  serviceMonitor:
    enabled: true
grafanaDashboards:
  enabled: true

kube-prometheus-stack:
  enabled: true

Access the Grafana UI

Forward the Grafana dashboard port to the local node-port

kubectl port-forward svc/<release-name>-grafana 8080:80

Open the webpage at http://<IP of your node>:8080 to access the Grafana web page. The default user name is admin and the password can be configured in the values (default is generated by helm and stored in a secret <release-name>-grafana).

Use Prometheus Adapter to export vLLM metrics

The vLLM router can export metrics to Prometheus using the Prometheus Adapter.

Install the chart with the following helm values to deploy prometheus-adapter

prometheus-adapter:
  enabled: true

We provide a minimal example of how to use the Prometheus Adapter to export vLLM metrics. See values.yaml for more details.

The exported metrics can be used for different purposes, such as horizontal scaling of the vLLM deployments.

To verify the metrics are being exported, you can use the following command:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | grep vllm_num_requests_waiting -C 10

You should see something like the following:

    {
      "name": "namespaces/vllm_num_requests_waiting",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }

The following command will show the current value of the metric:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/metrics/vllm_num_requests_waiting | jq

The output should look like the following:

{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Namespace",
        "name": "default",
        "apiVersion": "/v1"
      },
      "metricName": "vllm_num_requests_waiting",
      "timestamp": "2025-03-02T01:56:01Z",
      "value": "0",
      "selector": null
    }
  ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM Production Stack helm chart

Key features

Prerequisites

Install the helm chart

Uninstall the chart

Configure the deployments

Production Stack Helm Chart Values Reference

Table of Contents

Serving Engine Configuration

Model Specification Fields

Init Container Configuration

vLLM Configuration

LMCache Configuration

KEDA Autoscaling Configuration

Serving Engine Monitoring Configuration

Router Configuration

Router OpenTelemetry Configuration

Router Monitoring Configuration

Router Ingress Configuration

Cache Server Configuration

LoRA Adapters Configuration

LoRA Controller Configuration

Shared Storage Configuration

Other Configuration

Observability

Deploy the observability stack

On a cluster with prometheus operator installed

On an empty cluster

Access the Grafana UI

Use Prometheus Adapter to export vLLM metrics

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

vLLM Production Stack helm chart

Key features

Prerequisites

Install the helm chart

Uninstall the chart

Configure the deployments

Production Stack Helm Chart Values Reference

Table of Contents

Serving Engine Configuration

Model Specification Fields

Init Container Configuration

vLLM Configuration

LMCache Configuration

KEDA Autoscaling Configuration

Serving Engine Monitoring Configuration

Router Configuration

Router OpenTelemetry Configuration

Router Monitoring Configuration

Router Ingress Configuration

Cache Server Configuration

LoRA Adapters Configuration

LoRA Controller Configuration

Shared Storage Configuration

Other Configuration

Observability

Deploy the observability stack

On a cluster with prometheus operator installed

On an empty cluster

Access the Grafana UI

Use Prometheus Adapter to export vLLM metrics