diff --git a/docs/source/use_cases/autoscaling-keda.rst b/docs/source/use_cases/autoscaling-keda.rst index 1515acf17..624327cc5 100644 --- a/docs/source/use_cases/autoscaling-keda.rst +++ b/docs/source/use_cases/autoscaling-keda.rst @@ -1,7 +1,7 @@ Autoscaling with KEDA ===================== -This tutorial shows you how to automatically scale a vLLM deployment using `KEDA `_ and Prometheus-based metrics. You'll configure KEDA to monitor queue length and dynamically adjust the number of replicas based on load. +This tutorial shows you how to automatically scale a vLLM deployment using `KEDA `_ and Prometheus-based metrics. With the vLLM Production Stack Helm chart (v0.1.9+), KEDA autoscaling is integrated directly into the chart, allowing you to enable it through simple ``values.yaml`` configuration. Table of Contents ----------------- @@ -9,56 +9,81 @@ Table of Contents - Prerequisites_ - Steps_ - - `1. Install the vLLM Production Stack`_ - - `2. Deploy the Observability Stack`_ + - `1. Deploy the Observability Stack`_ + - `2. Configure and Deploy vLLM`_ - `3. Install KEDA`_ - - `4. Verify Metric Export`_ - - `5. Configure the ScaledObject`_ + - `4. Enable KEDA Autoscaling for vLLM`_ + - `5. Verify KEDA ScaledObject Creation`_ - `6. Test Autoscaling`_ - - `7. Cleanup`_ + - `7. Advanced Configuration`_ + - `8. Cleanup`_ - `Additional Resources`_ Prerequisites ------------- -- A working vLLM deployment on Kubernetes (see :doc:`../getting_started/quickstart`) - Access to a Kubernetes cluster with at least 2 GPUs -- ``kubectl`` and ``helm`` installed +- ``kubectl`` and ``helm`` installed (v3.0+) - Basic understanding of Kubernetes and Prometheus metrics Steps ----- -1. Install the vLLM Production Stack -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Install the production stack using a single pod by following the instructions in :doc:`../deployment/helm`. - -2. Deploy the Observability Stack +1. Deploy the Observability Stack ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This stack includes Prometheus, Grafana, and necessary exporters. +The observability stack (Prometheus, Grafana) is required for KEDA to query metrics. .. code-block:: bash cd observability bash install.sh -3. Install KEDA -~~~~~~~~~~~~~~~ +Verify Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``: .. code-block:: bash - kubectl create namespace keda - helm repo add kedacore https://kedacore.github.io/charts - helm repo update - helm install keda kedacore/keda --namespace keda + kubectl port-forward svc/prometheus-operated -n monitoring 9090:9090 + +In a separate terminal: + +.. code-block:: bash + + curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting' + +2. Configure and Deploy vLLM +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Create a ``values.yaml`` file to deploy vLLM. Note that we'll enable KEDA autoscaling in a later step after KEDA is installed: + +.. code-block:: yaml + + servingEngineSpec: + enableEngine: true + modelSpec: + - name: "llama3" + repository: "lmcache/vllm-openai" + tag: "latest" + modelURL: "meta-llama/Llama-3.1-8B-Instruct" + replicaCount: 1 + requestCPU: 10 + requestMemory: "64Gi" + requestGPU: 1 + +Deploy the chart: + +.. code-block:: bash + + helm install vllm vllm/vllm-stack -f values.yaml -4. Verify Metric Export -~~~~~~~~~~~~~~~~~~~~~~~ +Wait for the vLLM deployment to be ready and verify that metrics are being exposed: -Check that Prometheus is scraping the queue length metric ``vllm:num_requests_waiting``. +.. code-block:: bash + + kubectl wait --for=condition=ready pod -l model=llama3 --timeout=300s + +Verify Prometheus is scraping the vLLM metrics: .. code-block:: bash @@ -70,115 +95,216 @@ In a separate terminal: curl -G 'http://localhost:9090/api/v1/query' --data-urlencode 'query=vllm:num_requests_waiting' -Example output: +3. Install KEDA +~~~~~~~~~~~~~~~ -.. code-block:: json +Now that vLLM is running and exposing metrics, install KEDA to enable autoscaling: - { - "status": "success", - "data": { - "result": [ - { - "metric": { - "__name__": "vllm:num_requests_waiting", - "pod": "vllm-llama3-deployment-vllm-xxxxx" - }, - "value": [ 1749077215.034, "0" ] - } - ] - } - } +.. code-block:: bash -This means that at the given timestamp, there were 0 pending requests in the queue. + kubectl create namespace keda + helm repo add kedacore https://kedacore.github.io/charts + helm repo update + helm install keda kedacore/keda --namespace keda -5. Configure the ScaledObject -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Verify KEDA is running: + +.. code-block:: bash -The following ``ScaledObject`` configuration is provided in ``tutorials/assets/values-19-keda.yaml``. Review its contents: + kubectl get pods -n keda + +4. Enable KEDA Autoscaling for vLLM +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Update your ``values.yaml`` file to enable KEDA autoscaling: .. code-block:: yaml - apiVersion: keda.sh/v1alpha1 - kind: ScaledObject - metadata: - name: vllm-scaledobject - namespace: default - spec: - scaleTargetRef: - name: vllm-llama3-deployment-vllm - minReplicaCount: 1 - maxReplicaCount: 2 - pollingInterval: 15 - cooldownPeriod: 30 - triggers: - - type: prometheus - metadata: - serverAddress: http://prometheus-operated.monitoring.svc:9090 - metricName: vllm:num_requests_waiting - query: vllm:num_requests_waiting - threshold: '5' + servingEngineSpec: + enableEngine: true + modelSpec: + - name: "llama3" + repository: "lmcache/vllm-openai" + tag: "latest" + modelURL: "meta-llama/Llama-3.1-8B-Instruct" + replicaCount: 1 + requestCPU: 10 + requestMemory: "64Gi" + requestGPU: 1 + + # Enable KEDA autoscaling + keda: + enabled: true + minReplicaCount: 1 + maxReplicaCount: 3 + pollingInterval: 15 + cooldownPeriod: 360 + triggers: + - type: prometheus + metadata: + serverAddress: http://prometheus-operated.monitoring.svc:9090 + metricName: vllm:num_requests_waiting + query: vllm:num_requests_waiting + threshold: '5' + +Upgrade the chart to enable KEDA autoscaling: + +.. code-block:: bash -Apply the ScaledObject: + helm upgrade vllm vllm/vllm-stack -f values.yaml + +This configuration tells KEDA to: + +- Monitor the ``vllm:num_requests_waiting`` metric from Prometheus +- Maintain between 1 and 3 replicas +- Scale up when the queue exceeds 5 pending requests +- Check metrics every 15 seconds +- Wait 360 seconds before scaling down after scaling up + +5. Verify KEDA ScaledObject Creation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Check that the Helm chart created the ScaledObject resource: .. code-block:: bash - cd ../tutorials - kubectl apply -f assets/values-19-keda.yaml + kubectl get scaledobjects -This tells KEDA to: +You should see: -- Monitor ``vllm:num_requests_waiting`` -- Scale between 1 and 2 replicas -- Scale up when the queue exceeds 5 requests +.. code-block:: text -6. Test Autoscaling -~~~~~~~~~~~~~~~~~~~ + NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE + vllm-llama3-scaledobject apps/v1.Deployment vllm-llama3-deployment-vllm 1 3 prometheus True False Unknown Unknown 30s -Watch the deployment: +View the created HPA: .. code-block:: bash - kubectl get hpa -n default -w + kubectl get hpa -You should initially see: +Expected output: .. code-block:: text - NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS - keda-hpa-vllm-scaledobject Deployment/vllm-llama3-deployment-vllm 0/5 (avg) 1 2 1 + NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS + keda-hpa-vllm-llama3-scaledobject Deployment/vllm-llama3-deployment-vllm 0/5 (avg) 1 3 1 + +6. Test Autoscaling +~~~~~~~~~~~~~~~~~~~ + +Watch the HPA in real-time: -``TARGETS`` shows the current metric value vs. the target threshold. -``0/5 (avg)`` means the current value of ``vllm:num_requests_waiting`` is 0, and the threshold is 5. +.. code-block:: bash + + kubectl get hpa -n default -w -Generate load: +Generate load to trigger autoscaling. Port-forward to the router service: .. code-block:: bash kubectl port-forward svc/vllm-router-service 30080:80 -In a separate terminal: +In a separate terminal, run a load generator: .. code-block:: bash - python3 assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000 + python3 tutorials/assets/example-10-load-generator.py --num-requests 100 --prompt-len 3000 + +Within a few minutes, you should see the ``REPLICAS`` value increase as KEDA scales up to handle the load. + +7. Advanced Configuration +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Scale-to-Zero +^^^^^^^^^^^^^ -Within a few minutes, the ``REPLICAS`` value should increase to 2. +Enable scale-to-zero by setting ``minReplicaCount: 0`` and adding a traffic-based keepalive trigger: -7. Cleanup +.. code-block:: yaml + + keda: + enabled: true + minReplicaCount: 0 # Allow scaling to zero + maxReplicaCount: 5 + triggers: + # Queue-based scaling + - type: prometheus + metadata: + serverAddress: http://prometheus-operated.monitoring.svc:9090 + metricName: vllm:num_requests_waiting + query: vllm:num_requests_waiting + threshold: '5' + # Traffic-based keepalive (prevents scale-to-zero when traffic exists) + - type: prometheus + metadata: + serverAddress: http://prometheus-operated.monitoring.svc:9090 + metricName: vllm:incoming_keepalive + query: sum(rate(vllm:num_incoming_requests_total[1m]) > bool 0) + threshold: "1" + +Custom HPA Behavior +^^^^^^^^^^^^^^^^^^^ + +Control scaling behavior with custom HPA policies: + +.. code-block:: yaml + + keda: + enabled: true + minReplicaCount: 1 + maxReplicaCount: 5 + advanced: + horizontalPodAutoscalerConfig: + behavior: + scaleDown: + stabilizationWindowSeconds: 300 + policies: + - type: Percent + value: 50 + periodSeconds: 60 + +Fallback Configuration +^^^^^^^^^^^^^^^^^^^^^^ + +Configure fallback behavior when metrics are unavailable: + +.. code-block:: yaml + + keda: + enabled: true + fallback: + failureThreshold: 3 + replicas: 2 + +For more configuration options, see the `Helm chart README `_. + +8. Cleanup ~~~~~~~~~~ -To remove KEDA configuration and observability components: +To disable KEDA autoscaling, update your ``values.yaml`` to set ``keda.enabled: false`` and upgrade: + +.. code-block:: bash + + helm upgrade vllm vllm/vllm-stack -f values.yaml + +To completely remove KEDA from the cluster: .. code-block:: bash - kubectl delete -f assets/values-19-keda.yaml helm uninstall keda -n keda kubectl delete namespace keda - cd ../observability +To remove the observability stack: + +.. code-block:: bash + + cd observability bash uninstall.sh Additional Resources -------------------- - `KEDA Documentation `_ +- `KEDA ScaledObject Specification `_ +- `Helm Chart KEDA Configuration `_ diff --git a/helm/Chart.yaml b/helm/Chart.yaml index 937ab7782..9592df114 100644 --- a/helm/Chart.yaml +++ b/helm/Chart.yaml @@ -15,7 +15,7 @@ type: application # This is the chart version. This version number should be incremented each time you make changes # to the chart and its templates, including the app version. # Versions are expected to follow Semantic Versioning (https://semver.org/) -version: 0.1.8 +version: 0.1.9 maintainers: - name: apostac diff --git a/helm/README.md b/helm/README.md index 20188ba66..798c61ee3 100644 --- a/helm/README.md +++ b/helm/README.md @@ -152,6 +152,39 @@ This table documents all available configuration values for the Production Stack | `servingEngineSpec.modelSpec[].lmcacheConfig.nixlPeerPort` | string | `"55555"` | NIXL peer port for KV transfer | | `servingEngineSpec.modelSpec[].lmcacheConfig.nixlBufferSize` | string | `"1073741824"` | NIXL buffer size for KV transfer | +#### KEDA Autoscaling Configuration + +> **Note**: Unless explicitly set, KEDA's default values will apply. The defaults shown below are KEDA's defaults, not values enforced by this Helm chart. + +| Field | Type | KEDA Default | Description | +|-------|------|--------------|-------------| +| `servingEngineSpec.modelSpec[].keda.enabled` | boolean | `false` | Enable KEDA autoscaling for this model deployment (requires KEDA installed in cluster) | +| `servingEngineSpec.modelSpec[].keda.minReplicaCount` | integer | - | Minimum number of replicas (supports 0 for scale-to-zero); if not set, HPA minReplicas default applies | +| `servingEngineSpec.modelSpec[].keda.maxReplicaCount` | integer | - | Maximum number of replicas; if not set, HPA maxReplicas default applies | +| `servingEngineSpec.modelSpec[].keda.pollingInterval` | integer | `30` | How often KEDA checks metrics (in seconds) | +| `servingEngineSpec.modelSpec[].keda.cooldownPeriod` | integer | `300` | Wait time before scaling down after scaling up (in seconds) | +| `servingEngineSpec.modelSpec[].keda.idleReplicaCount` | integer | - | Number of replicas when no triggers are active | +| `servingEngineSpec.modelSpec[].keda.initialCooldownPeriod` | integer | - | Initial cooldown period before scaling down after creation (in seconds) | +| `servingEngineSpec.modelSpec[].keda.fallback` | map | - | Fallback configuration when scaler fails | +| `servingEngineSpec.modelSpec[].keda.fallback.failureThreshold` | integer | - | Number of consecutive failures before fallback | +| `servingEngineSpec.modelSpec[].keda.fallback.replicas` | integer | - | Number of replicas to scale to in fallback | +| `servingEngineSpec.modelSpec[].keda.triggers` | list | See below | List of KEDA trigger configurations (Prometheus-based) | +| `servingEngineSpec.modelSpec[].keda.triggers[].type` | string | - | Trigger type (e.g., `"prometheus"`) | +| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.serverAddress` | string | - | Prometheus server URL (e.g., ) | +| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.metricName` | string | - | Name of the metric to monitor | +| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.query` | string | - | PromQL query to fetch the metric | +| `servingEngineSpec.modelSpec[].keda.triggers[].metadata.threshold` | string | - | Threshold value that triggers scaling | +| `servingEngineSpec.modelSpec[].keda.advanced` | map | - | Advanced KEDA configuration options | +| `servingEngineSpec.modelSpec[].keda.advanced.restoreToOriginalReplicaCount` | boolean | `false` | Restore original replica count when ScaledObject is deleted | +| `servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig` | map | - | HPA-specific configuration | +| `servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.name` | string | `keda-hpa-{scaled-object-name}` | Custom name for HPA resource | +| `servingEngineSpec.modelSpec[].keda.advanced.horizontalPodAutoscalerConfig.behavior` | map | - | HPA scaling behavior configuration (see [K8s docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)) | +| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers` | map | - | Scaling modifiers for composite metrics | +| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.target` | string | - | Target value for the composed metric | +| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.activationTarget` | string | - | Activation target for the composed metric | +| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.metricType` | string | `"AverageValue"` | Metric type (AverageValue or Value) | +| `servingEngineSpec.modelSpec[].keda.advanced.scalingModifiers.formula` | string | - | Formula to compose metrics together | + ### Router Configuration | Field | Type | Default | Description | diff --git a/helm/templates/scaledobject-vllm.yaml b/helm/templates/scaledobject-vllm.yaml new file mode 100644 index 000000000..be813c03c --- /dev/null +++ b/helm/templates/scaledobject-vllm.yaml @@ -0,0 +1,74 @@ +{{- if .Values.servingEngineSpec.enableEngine -}} +{{- range $modelSpec := .Values.servingEngineSpec.modelSpec }} +{{- if and (hasKey $modelSpec "keda") $modelSpec.keda.enabled }} +{{- if not (hasKey $modelSpec "raySpec") }} +{{- with $ -}} +apiVersion: keda.sh/v1alpha1 +kind: ScaledObject +metadata: + name: "{{ .Release.Name }}-{{$modelSpec.name}}-scaledobject" + namespace: {{ .Release.Namespace }} + labels: + model: {{ $modelSpec.name }} + helm-release-name: {{ .Release.Name }} + {{- include "chart.engineLabels" . | nindent 4 }} +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: "{{ .Release.Name }}-{{$modelSpec.name}}-deployment-vllm" + {{- if hasKey $modelSpec.keda "minReplicaCount" }} + minReplicaCount: {{ $modelSpec.keda.minReplicaCount }} + {{- end }} + {{- if hasKey $modelSpec.keda "maxReplicaCount" }} + maxReplicaCount: {{ $modelSpec.keda.maxReplicaCount }} + {{- end }} + {{- if hasKey $modelSpec.keda "pollingInterval" }} + pollingInterval: {{ $modelSpec.keda.pollingInterval }} + {{- end }} + {{- if hasKey $modelSpec.keda "cooldownPeriod" }} + cooldownPeriod: {{ $modelSpec.keda.cooldownPeriod }} + {{- end }} + {{- if hasKey $modelSpec.keda "idleReplicaCount" }} + idleReplicaCount: {{ $modelSpec.keda.idleReplicaCount }} + {{- end }} + {{- if hasKey $modelSpec.keda "initialCooldownPeriod" }} + initialCooldownPeriod: {{ $modelSpec.keda.initialCooldownPeriod }} + {{- end }} + {{- if hasKey $modelSpec.keda "fallback" }} + fallback: + {{- toYaml $modelSpec.keda.fallback | nindent 4 }} + {{- end }} + {{- if hasKey $modelSpec.keda "advanced" }} + advanced: + {{- if hasKey $modelSpec.keda.advanced "restoreToOriginalReplicaCount" }} + restoreToOriginalReplicaCount: {{ $modelSpec.keda.advanced.restoreToOriginalReplicaCount }} + {{- end }} + {{- if hasKey $modelSpec.keda.advanced "horizontalPodAutoscalerConfig" }} + horizontalPodAutoscalerConfig: + {{- toYaml $modelSpec.keda.advanced.horizontalPodAutoscalerConfig | nindent 6 }} + {{- end }} + {{- if hasKey $modelSpec.keda.advanced "scalingModifiers" }} + scalingModifiers: + {{- toYaml $modelSpec.keda.advanced.scalingModifiers | nindent 6 }} + {{- end }} + {{- end }} + {{- if hasKey $modelSpec.keda "triggers" }} + triggers: + {{- toYaml $modelSpec.keda.triggers | nindent 4 }} + {{- else }} + # Default trigger configuration - monitors vllm queue length + triggers: + - type: prometheus + metadata: + serverAddress: http://prometheus-operated.monitoring.svc:9090 + metricName: vllm:num_requests_waiting + query: vllm:num_requests_waiting{model="{{ $modelSpec.name }}"} + threshold: '5' + {{- end }} +--- +{{- end }} +{{- end }} +{{- end }} +{{- end }} +{{- end }} diff --git a/helm/tests/keda_test.yaml b/helm/tests/keda_test.yaml new file mode 100644 index 000000000..402e1809d --- /dev/null +++ b/helm/tests/keda_test.yaml @@ -0,0 +1,160 @@ +suite: test KEDA autoscaling configuration +templates: + - scaledobject-vllm.yaml + - deployment-vllm-multi.yaml +tests: + - it: should not create ScaledObject when keda is not enabled + set: + servingEngineSpec: + enableEngine: true + modelSpec: + - name: "test-model" + repository: "vllm/vllm-openai" + tag: "latest" + modelURL: "facebook/opt-125m" + replicaCount: 1 + requestCPU: 1 + requestMemory: "1Gi" + requestGPU: 1 + keda: + enabled: false + asserts: + - template: scaledobject-vllm.yaml + hasDocuments: + count: 0 + + - it: should create ScaledObject when keda is enabled + set: + servingEngineSpec: + enableEngine: true + modelSpec: + - name: "test-model" + repository: "vllm/vllm-openai" + tag: "latest" + modelURL: "facebook/opt-125m" + replicaCount: 1 + requestCPU: 1 + requestMemory: "1Gi" + requestGPU: 1 + keda: + enabled: true + minReplicaCount: 1 + maxReplicaCount: 3 + triggers: + - type: prometheus + metadata: + serverAddress: http://prometheus-operated.monitoring.svc:9090 + metricName: vllm:num_requests_waiting + query: vllm:num_requests_waiting + threshold: '5' + asserts: + - template: scaledobject-vllm.yaml + hasDocuments: + count: 1 + - template: scaledobject-vllm.yaml + equal: + path: metadata.name + value: RELEASE-NAME-test-model-scaledobject + - template: scaledobject-vllm.yaml + equal: + path: spec.minReplicaCount + value: 1 + - template: scaledobject-vllm.yaml + equal: + path: spec.maxReplicaCount + value: 3 + + - it: should use default trigger when checks are not provided + set: + servingEngineSpec: + enableEngine: true + modelSpec: + - name: "test-model-default" + repository: "vllm/vllm-openai" + tag: "latest" + modelURL: "facebook/opt-125m" + replicaCount: 1 + requestCPU: 1 + requestMemory: "1Gi" + requestGPU: 1 + keda: + enabled: true + minReplicaCount: 1 + maxReplicaCount: 5 + asserts: + - template: scaledobject-vllm.yaml + hasDocuments: + count: 1 + - template: scaledobject-vllm.yaml + equal: + path: spec.triggers[0].type + value: prometheus + - template: scaledobject-vllm.yaml + equal: + path: spec.triggers[0].metadata.query + value: vllm:num_requests_waiting{model="test-model-default"} + + - it: should support advanced KEDA configuration + set: + servingEngineSpec: + enableEngine: true + modelSpec: + - name: "test-model-advanced" + repository: "vllm/vllm-openai" + tag: "latest" + modelURL: "facebook/opt-125m" + replicaCount: 1 + requestCPU: 1 + requestMemory: "1Gi" + requestGPU: 1 + keda: + enabled: true + minReplicaCount: 0 + maxReplicaCount: 10 + pollingInterval: 10 + cooldownPeriod: 60 + idleReplicaCount: 0 + initialCooldownPeriod: 30 + fallback: + failureThreshold: 5 + replicas: 2 + advanced: + restoreToOriginalReplicaCount: true + horizontalPodAutoscalerConfig: + name: my-hpa + behavior: + scaleDown: + stabilizationWindowSeconds: 300 + asserts: + - template: scaledobject-vllm.yaml + equal: + path: spec.pollingInterval + value: 10 + - template: scaledobject-vllm.yaml + equal: + path: spec.cooldownPeriod + value: 60 + - template: scaledobject-vllm.yaml + equal: + path: spec.idleReplicaCount + value: 0 + - template: scaledobject-vllm.yaml + equal: + path: spec.initialCooldownPeriod + value: 30 + - template: scaledobject-vllm.yaml + equal: + path: spec.fallback.failureThreshold + value: 5 + - template: scaledobject-vllm.yaml + equal: + path: spec.fallback.replicas + value: 2 + - template: scaledobject-vllm.yaml + equal: + path: spec.advanced.restoreToOriginalReplicaCount + value: true + - template: scaledobject-vllm.yaml + equal: + path: spec.advanced.horizontalPodAutoscalerConfig.name + value: my-hpa diff --git a/helm/values.schema.json b/helm/values.schema.json index ef618fa5a..954d90e7c 100644 --- a/helm/values.schema.json +++ b/helm/values.schema.json @@ -306,6 +306,111 @@ } } } + }, + "keda": { + "type": "object", + "description": "KEDA autoscaling configuration for this model deployment", + "properties": { + "enabled": { + "type": "boolean", + "description": "Whether to enable KEDA autoscaling for this model" + }, + "minReplicaCount": { + "type": "integer", + "description": "Minimum number of replicas (supports 0 for scale-to-zero)", + "minimum": 0 + }, + "maxReplicaCount": { + "type": "integer", + "description": "Maximum number of replicas", + "minimum": 1 + }, + "pollingInterval": { + "type": "integer", + "description": "How often KEDA should check the metrics (in seconds)", + "minimum": 1 + }, + "cooldownPeriod": { + "type": "integer", + "description": "How long to wait before scaling down after scaling up (in seconds)", + "minimum": 0 + }, + "idleReplicaCount": { + "type": "integer", + "description": "Number of replicas to scale to when no triggers are active", + "minimum": 0 + }, + "initialCooldownPeriod": { + "type": "integer", + "description": "Initial cooldown period in seconds before scaling down after creation", + "minimum": 0 + }, + "fallback": { + "type": "object", + "description": "Fallback configuration when scaler fails", + "properties": { + "failureThreshold": { + "type": "integer", + "description": "Number of consecutive failures before fallback", + "minimum": 1 + }, + "replicas": { + "type": "integer", + "description": "Number of replicas to scale to in fallback", + "minimum": 0 + } + }, + "required": [ + "failureThreshold", + "replicas" + ] + }, + "triggers": { + "type": "array", + "description": "List of KEDA trigger configurations", + "items": { + "type": "object", + "properties": { + "type": { + "type": "string", + "description": "Trigger type (e.g., prometheus, cpu, memory)" + }, + "metadata": { + "type": "object", + "description": "Trigger-specific metadata", + "additionalProperties": true + } + }, + "required": [ + "type", + "metadata" + ] + } + }, + "advanced": { + "type": "object", + "description": "Advanced KEDA configuration", + "properties": { + "restoreToOriginalReplicaCount": { + "type": "boolean", + "description": "Restore original replica count when ScaledObject is deleted" + }, + "horizontalPodAutoscalerConfig": { + "type": "object", + "description": "HPA-specific configuration", + "additionalProperties": true + }, + "scalingModifiers": { + "type": "object", + "description": "Scaling modifiers for composite metrics", + "additionalProperties": true + } + } + } + }, + "required": [ + "enabled" + ] } }, "required": [ @@ -492,16 +597,26 @@ "type": "string" } }, - "required": ["name"] + "required": [ + "name" + ] } }, "autoscaling": { "type": "object", "properties": { - "enabled": {"type": "boolean"}, - "minReplicas": {"type": "integer"}, - "maxReplicas": {"type": "integer"}, - "targetCPUUtilizationPercentage": {"type": "integer"} + "enabled": { + "type": "boolean" + }, + "minReplicas": { + "type": "integer" + }, + "maxReplicas": { + "type": "integer" + }, + "targetCPUUtilizationPercentage": { + "type": "integer" + } } }, "containerPort": { @@ -517,19 +632,24 @@ "description": "Container-level security context configuration", "additionalProperties": true }, - "servicePort": { "type": "integer" }, "serviceDiscovery": { "type": "string", "description": "Service discovery mode. Available values: k8s or static.", - "enum": ["k8s", "static"] + "enum": [ + "k8s", + "static" + ] }, "k8sServiceDiscoveryType": { "type": "string", "description": "Kubernetes service discovery type. Available values: pod-ip and service-name.", - "enum": ["pod-ip", "service-name"] + "enum": [ + "pod-ip", + "service-name" + ] }, "routingLogic": { "type": "string" diff --git a/helm/values.yaml b/helm/values.yaml index f011ac552..ecd2c0607 100644 --- a/helm/values.yaml +++ b/helm/values.yaml @@ -119,6 +119,35 @@ servingEngineSpec: # - shmSize: (optional, string) The size of the shared memory, e.g., "20Gi" # - enableLoRA: (optional, bool) Whether to enable LoRA, e.g., true # + # - keda: (optional, map) KEDA autoscaling configuration for this model deployment. Requires KEDA to be installed in the cluster. + # - enabled: (optional, bool) Whether to enable KEDA autoscaling for this model, e.g., true + # - minReplicaCount: (optional, int) Minimum number of replicas (supports 0 for scale-to-zero), e.g., 1 + # - maxReplicaCount: (optional, int) Maximum number of replicas, e.g., 5 + # - pollingInterval: (optional, int) How often KEDA should check the metrics (in seconds), e.g., 15 + # - cooldownPeriod: (optional, int) How long to wait before scaling down after scaling up (in seconds), e.g., 360 + # - idleReplicaCount: (optional, int) Number of replicas to scale to when no triggers are active, e.g., 0 + # - initialCooldownPeriod: (optional, int) Initial cooldown period in seconds before scaling down after creation, e.g., 60 + # - fallback: (optional, map) Fallback configuration when scaler fails + # - failureThreshold: (int) Number of consecutive failures before fallback, e.g., 3 + # - replicas: (int) Number of replicas to scale to in fallback, e.g., 2 + # - triggers: (optional, list) List of KEDA trigger configurations + # - type: (string) Trigger type, e.g., "prometheus" + # - metadata: (map) Trigger metadata + # - serverAddress: (string) Prometheus server address, e.g., "http://prometheus-operated.monitoring.svc:9090" + # - metricName: (string) Name of the metric to monitor, e.g., "vllm:num_requests_waiting" + # - query: (string) Prometheus query to fetch the metric, e.g., "vllm:num_requests_waiting" + # - threshold: (string) Threshold value that triggers scaling, e.g., "5" + # - advanced: (optional, map) Advanced KEDA configuration + # - restoreToOriginalReplicaCount: (optional, bool) Restore original replica count when ScaledObject is deleted, e.g., false + # - horizontalPodAutoscalerConfig: (optional, map) HPA-specific configuration + # - name: (optional, string) Custom name for the HPA resource, default: "keda-hpa-{scaled-object-name}" + # - behavior: (optional, map) HPA scaling behavior configuration, see https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ + # - scalingModifiers: (optional, map) Scaling modifiers for composite metrics + # - target: (string) Target value for the composed metric + # - activationTarget: (optional, string) Activation target for the composed metric + # - metricType: (optional, string) Metric type (AverageValue or Value), default: "AverageValue" + # - formula: (string) Formula to compose metrics together + # # Example: # vllmApiKey: "vllm_xxxxxxxxxxxxx" # modelSpec: @@ -177,6 +206,21 @@ servingEngineSpec: # operator: "In" # values: # - "NVIDIA-RTX-A6000" + # + # keda: + # enabled: true + # minReplicaCount: 1 + # maxReplicaCount: 3 + # pollingInterval: 15 + # cooldownPeriod: 360 + # triggers: + # - type: prometheus + # metadata: + # serverAddress: http://prometheus-operated.monitoring.svc:9090 + # metricName: vllm:num_requests_waiting + # query: vllm:num_requests_waiting + # threshold: '5' + # extraVolumes: # - name: dev-fuse