HPA recipe for AI inference server using custom metrics #570

seans3 · 2025-08-26T23:28:11Z

This PR extends the existing vLLM server example by introducing a complete Horizontal Pod Autoscaling (HPA) solution, contained within a new hpa/ directory. This provides a production-ready pattern for automatically scaling the AI inference server based on real-time demand.

Two distinct autoscaling methods are provided:

By vLLM Server Metrics: Scales based on the number of concurrent inference requests.
By NVIDIA GPU Utilization: Scales based on hardware-level GPU utilization.

How It Works

The solution uses a standard Prometheus-based monitoring pipeline. The Prometheus Operator scrapes metrics from either the vLLM server or the NVIDIA DCGM exporter. For GPU metrics, a PrometheusRule is used to relabel the raw data, making it compatible with the HPA. The Prometheus Adapter then exposes these metrics to the Kubernetes Custom Metrics API, which the HPA controller consumes to drive scaling decisions.

What's New

hpa/ directory: Contains all new manifests and documentation.
Two HPA Examples: Includes horizontal-pod-autoscaler.yaml for vLLM metrics and gpu-horizontal-pod-autoscaler.yaml for GPU metrics.
Step-by-Step Guides: vllm-hpa.md and gpu-hpa.md provide detailed instructions for each scaling method, including multi-cloud
support for the GPU example.
Load Testing Script: request-looper.sh is included to easily generate load and test the autoscaling functionality.

How to Test

Detailed instructions and verification steps are available in the new guides:

For vLLM metrics: hpa/vllm-hpa.md
For GPU metrics: hpa/gpu-hpa.md

seans3 · 2025-08-26T23:32:24Z

/assign @janetkuo

k8s-ci-robot · 2025-08-27T04:26:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: seans3
Once this PR has been reviewed and has the lgtm label, please ask for approval from janetkuo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

janetkuo

Thanks for adding this practical example for autoscaling!

janetkuo · 2025-08-29T23:28:05Z

ai/vllm-deployment/hpa/README.md

+
+## Prerequisites
+
+This guide assumes you have a running Kubernetes cluster and `kubectl` installed. The vLLM server will be deployed in the `default` namespace, and the Prometheus and HPA resources will be in the `monitoring` namespace.


Just noticed that default namespace is used for vLLM. Kubernetes best practice is to avoid deploying applications in the default namespace. Using it for actual workloads can lead to significant operational and security challenges as the cluster usage grows.

This PR is already pretty big; should I change the vLLM deployment in a separate PR to use a namespace (then return to this)? Or should I fix it in this PR?

I have updated the parent directory vLLM deployment to install to a non-default namespace. I have updated all the configuration and instructions in this PR to reflect that change. Please have a look.

janetkuo · 2025-08-30T00:17:31Z

ai/vllm-deployment/hpa/README.md

+                                                 ▼
+                                           ┌────────────────┐
+                                           │ PrometheusRule │
+                                           └────────────────┘


Some suggestions for the diagram to make it more clear:

Use numbered steps and arrow directions to guide the user through the precise data flow (scrape, evaluate, record, query, scale) from start to finish

The flow hides the crucial transformation step where a raw metric is converted into a processed metric. Recommend to clearly label the initial scrape with the raw DCGM metric name and the query from the adapter with the new, processed metric name.

The PrometheusRule is shown as a final step in the "GPU Path Only". However, the PrometheusRule is not a destination for data; it's a configuration that tells the Prometheus Server how to perform an internal calculation

Include the Kubernetes API Server between the adapter and HPA

Just noticed that GitHub supports mermaid diagrams in markdown: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams#creating-mermaid-diagrams
Might be easier to edit than ASCII diagrams.

Great call. I've added a mermaid diagram, and I believe it now addresses the issues you raised. Please let me know what you think.

janetkuo · 2025-08-30T00:21:58Z

ai/vllm-deployment/hpa/README.md

+
+## II. HPA for vLLM AI Inference Server using NVidia GPU metrics
+
+[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md)


We could discuss the trade-offs between these 2 metrics options here, and how to combine multiple metrics for robustness (e.g., scale up if either the number of running requests exceeds a certain threshold or GPU utilization spikes.)

I've added significant documentation now addressing the metric trade-offs as well as the combination of multiple metrics. Please let me know what you think.

janetkuo · 2025-08-30T00:27:19Z

ai/vllm-deployment/hpa/gpu-horizontal-pod-autoscaler.yaml

+        averageValue: 20
+  behavior:
+    scaleUp:
+      stabilizationWindowSeconds: 0


We can discuss the trade-offs here, e.g. the risk of over-scaling vs. with highly volatile workloads where immediate scaling up is critical to maintain performance and responsiveness

Added comments in the YAML of the trade-offs for the scale-up and scale-down behavior (also the scale-down behavior for the vLLM HPA). Please let me know what you think.

janetkuo · 2025-08-30T00:29:34Z

ai/vllm-deployment/hpa/gpu-service-monitor.yaml

+  # the labels on the 'gke-managed-dcgm-exporter' Service.
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: gke-managed-dcgm-exporter


Does the label value need to be GKE specific? Can this be more generic?

GKE gives the user this DCGM Exporter for free, since its always on NVidia GPU nodes. But for the other two major cloud providers, the user has to install it. Trying to install it manually in GKE, however, causes conflicts with the exporter already on the nodes. I've added comments about how other cloud providers need to install it, and I've called out the GKE-specific namespace/labels. Please let me know what you think.

janetkuo · 2025-09-29T20:33:50Z

AI/vllm-deployment/hpa/gpu-hpa.md

+
+#### Google Kubernetes Engine (GKE)
+
+On GKE, the DCGM exporter is a managed add-on that is automatically deployed and managed by the system. It runs in the `gke-managed-system` namespace.


How is the managed DCGM installed? IIUC, it's part of the managed metrics feature, which means that it won't require installation of Prometheus Operator.

I suggest separating platform-specific setup, and managed vs. custom components more clearly.

Also suggest adding gke to file names for the files that are GKE specific, to make it easier to see what's platform-specific at a glance

I'd imagine something like:

Install DCGM
helm ...
For GKE users, you can skip this step and use managed DCGM instead... (and maybe link here https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics)
...

Install the Prometheus Operator
(Maybe add something similar here for GKE - this can be skipped if managed collection is used, ref https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed)

janetkuo · 2025-09-29T20:40:00Z

AI/vllm-deployment/hpa/README.md

+
+> **Note on Cluster Permissions:** This exercise requires permissions to install components that run on the cluster nodes themselves. The Prometheus Operator and the NVIDIA DCGM Exporter both deploy DaemonSets that require privileged access to the nodes to collect metrics. For GKE users, this means a **GKE Standard** cluster is required, as GKE Autopilot's security model restricts this level of node access.
+
+### Prometheus Operator Installation


This step is already covered in the sub sections. Do we need to repeat it here?

janetkuo · 2025-09-29T20:51:34Z

AI/vllm-deployment/hpa/prometheus-adapter.yaml

+    # It takes 'gke_dcgm_fi_dev_gpu_util_relabelled' (which now has the correct
+    # pod and namespace labels) and renames it to the final, clean name
+    # 'gpu_utilization_percent' for the HPA to use.
+    - seriesQuery: 'gke_dcgm_fi_dev_gpu_util_relabelled'


Given that you're sharing the same adapter for two scenarios, would you make it more clear in the markdown?

janetkuo · 2025-09-29T20:53:41Z

AI/vllm-deployment/hpa/gpu-hpa.md

+```bash
+helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
+helm repo update
+helm install dcgm-exporter nvdp/dcgm-exporter --namespace monitoring


This is slightly different from https://github.com/NVIDIA/dcgm-exporter README

janetkuo · 2025-09-29T20:54:19Z

AI/vllm-deployment/hpa/gpu-hpa.md

+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts/
+helm repo update
+helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
+```


Is this step still needed if using managed metrics in GKE?

janetkuo · 2025-09-29T21:01:01Z

AI/vllm-deployment/hpa/prometheus-rule.yaml

+  - name: dcgm.rules
+    rules:
+    # 'record' specifies the name of the new metric to be created.
+    - record: gke_dcgm_fi_dev_gpu_util_relabelled


Does the name need to have gke in it? This config seems generic

janetkuo · 2025-09-29T21:02:17Z

AI/vllm-deployment/hpa/gpu-service-monitor.yaml

+      # GKE-SPECIFIC: This label matches the Service for GKE's managed DCGM
+      # exporter. If you are using a different DCGM deployment, you must
+      # update this label to match the label of the corresponding Service.
+      app.kubernetes.io/name: gke-managed-dcgm-exporter


Given that this guide includes installing DCGM manually, would you clarify what label should be used here?

janetkuo · 2025-09-29T21:03:21Z

AI/vllm-deployment/hpa/gpu-service-monitor.yaml

+    # GKE-SPECIFIC: This is the namespace for GKE's managed DCGM exporter.
+    # For other environments, this should be the namespace where you have
+    # deployed the DCGM exporter Service.
+    - gke-managed-system


Given that we use monitoring in the example for other platforms, suggest adding that here

janetkuo · 2025-09-29T21:06:40Z

AI/vllm-deployment/hpa/gpu-service-monitor.yaml

+  selector:
+    matchLabels:
+      # GKE-SPECIFIC: This label matches the Service for GKE's managed DCGM
+      # exporter. If you are using a different DCGM deployment, you must


nit: DCGM is a daemon

seans3 · 2025-10-01T18:22:51Z

/assign @justinsb

seans3 · 2025-10-01T18:23:03Z

/cc @justinsb

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 26, 2025

k8s-ci-robot requested review from kow3ns and soltysh August 26, 2025 23:28

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 26, 2025

k8s-ci-robot assigned janetkuo Aug 26, 2025

seans3 force-pushed the hpa-example branch from 9536fa2 to dfdeca4 Compare August 27, 2025 04:26

janetkuo reviewed Aug 30, 2025

View reviewed changes

seans3 force-pushed the hpa-example branch from dfdeca4 to e9058cc Compare September 8, 2025 19:54

hpa recipe for ai inference using gpu custom metrics

fa77f06

seans3 force-pushed the hpa-example branch from 532657a to fa77f06 Compare September 18, 2025 16:43

janetkuo reviewed Sep 29, 2025

View reviewed changes

k8s-ci-robot assigned justinsb Oct 1, 2025

k8s-ci-robot requested a review from justinsb October 1, 2025 18:23


		## Prerequisites

		This guide assumes you have a running Kubernetes cluster and `kubectl` installed. The vLLM server will be deployed in the `default` namespace, and the Prometheus and HPA resources will be in the `monitoring` namespace.


		## II. HPA for vLLM AI Inference Server using NVidia GPU metrics

		[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md)


		#### Google Kubernetes Engine (GKE)

		On GKE, the DCGM exporter is a managed add-on that is automatically deployed and managed by the system. It runs in the `gke-managed-system` namespace.


		> Note on Cluster Permissions: This exercise requires permissions to install components that run on the cluster nodes themselves. The Prometheus Operator and the NVIDIA DCGM Exporter both deploy DaemonSets that require privileged access to the nodes to collect metrics. For GKE users, this means a GKE Standard cluster is required, as GKE Autopilot's security model restricts this level of node access.

		### Prometheus Operator Installation

HPA recipe for AI inference server using custom metrics #570

Are you sure you want to change the base?

HPA recipe for AI inference server using custom metrics #570

Conversation

seans3 commented Aug 26, 2025

Two distinct autoscaling methods are provided:

How It Works

What's New

How to Test

Uh oh!

seans3 commented Aug 26, 2025

Uh oh!

k8s-ci-robot commented Aug 27, 2025

Uh oh!

janetkuo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seans3 Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janetkuo Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seans3 commented Oct 1, 2025

Uh oh!

seans3 commented Oct 1, 2025

Uh oh!

Uh oh!

seans3 Sep 18, 2025 •

edited

Loading

janetkuo Sep 29, 2025 •

edited

Loading