Skip to content

Commit fa77f06

Browse files
committed
hpa recipe for ai inference using gpu custom metrics
1 parent fbb2da7 commit fa77f06

12 files changed

+994
-0
lines changed

AI/vllm-deployment/hpa/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
GEMINI.md

AI/vllm-deployment/hpa/README.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Horizontal Pod Autoscaling AI Inference Server
2+
3+
This exercise shows how to set up the infrastructure to automatically
4+
scale an AI inference server, using custom metrics (either server
5+
or GPU metrics). This exercise requires a running Prometheus instance,
6+
preferably managed by the Prometheus Operator. We assume
7+
you already have the vLLM AI inference server running from this
8+
[exercise](../README.md), in the parent directory.
9+
10+
## Architecture
11+
12+
The autoscaling solution works as follows:
13+
14+
1. The **vLLM Server** or the **NVIDIA DCGM Exporter** exposes raw metrics on a `/metrics` endpoint.
15+
2. A **ServiceMonitor** resource declaratively specifies how Prometheus should discover and scrape these metrics.
16+
3. The **Prometheus Operator** detects the `ServiceMonitor` and configures its managed **Prometheus Server** instance to begin scraping the metrics.
17+
4. For GPU metrics, a **PrometheusRule** is used to relabel the raw DCGM metrics, creating a new, HPA-compatible metric.
18+
5. The **Prometheus Adapter** queries the Prometheus Server for the processed metrics and exposes them through the Kubernetes custom metrics API.
19+
6. The **Horizontal Pod Autoscaler (HPA)** controller queries the custom metrics API for the metrics and compares them to the target values defined in the `HorizontalPodAutoscaler` resource.
20+
7. If the metrics exceed the target, the HPA scales up the `vllm-gemma-deployment`.
21+
22+
23+
```mermaid
24+
flowchart TD
25+
D("PrometheusRule (GPU Metric Only)")
26+
B("Prometheus Server")
27+
C("ServiceMonitor")
28+
subgraph subGraph0["Metrics Collection"]
29+
A["vLLM Server"]
30+
H["GPU DCGM Exporter"]
31+
end
32+
subgraph subGraph1["HPA Scaling Logic"]
33+
E("Prometheus Adapter")
34+
F("API Server (Custom Metrics)")
35+
G("HPA Controller")
36+
end
37+
B -- Scrapes Raw Metrics --> A
38+
B -- Scrapes Raw Metrics --> H
39+
C -- Configures Scrape <--> B
40+
B -- Processes Raw Metrics via --> D
41+
D -- Creates Clean Metric in --> B
42+
F -- Custom Metrics API <--> E
43+
E -- Queries Processed Metric <--> B
44+
G -- Queries Custom Metric --> F
45+
```
46+
47+
48+
## Prerequisites
49+
50+
This guide assumes you have a running Kubernetes cluster and `kubectl` installed. The vLLM server will be deployed in the `vllm-example` namespace, and the Prometheus resources will be in the `monitoring` namespace. The HPA resources will be deployed to the `vllm-example` namespace by specifying the namespace on the command line.
51+
52+
> **Note on Cluster Permissions:** This exercise requires permissions to install components that run on the cluster nodes themselves. The Prometheus Operator and the NVIDIA DCGM Exporter both deploy DaemonSets that require privileged access to the nodes to collect metrics. For GKE users, this means a **GKE Standard** cluster is required, as GKE Autopilot's security model restricts this level of node access.
53+
54+
### Prometheus Operator Installation
55+
56+
The following commands will install the Prometheus Operator. It is recommended to install it in its own `monitoring` namespace.
57+
58+
```bash
59+
# Add the Prometheus community Helm repository
60+
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts/
61+
helm repo update
62+
63+
# Install the Prometheus Operator into the "monitoring" namespace
64+
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
65+
```
66+
**Note:** The default configuration of the Prometheus Operator only watches for `ServiceMonitor` resources within its own namespace. The `vllm-service-monitor.yaml` is configured to be in the `monitoring` namespace and watch for services in the `vllm-example` namespace, so no extra configuration is needed.
67+
68+
## I. HPA for vLLM AI Inference Server using vLLM metrics
69+
70+
[vLLM AI Inference Server HPA](./vllm-hpa.md)
71+
72+
## II. HPA for vLLM AI Inference Server using NVidia GPU metrics
73+
74+
[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md)
75+
76+
### Choosing the Right Metric: Trade-offs and Combining Metrics
77+
78+
This project provides two methods for autoscaling: one based on the number of running requests (`vllm:num_requests_running`) and the other on GPU utilization (`dcgm_fi_dev_gpu_util`). Each has its own advantages, and they can be combined for a more robust scaling strategy.
79+
80+
#### **Trade-offs**
81+
82+
* **Number of Running Requests (Application-Level Metric):**
83+
* **Pros:** This is a direct measure of the application's current workload. It is highly responsive to sudden changes in traffic, making it ideal for latency-sensitive applications. Scaling decisions are based on the actual number of requests being processed, which can be a more accurate predictor of future load than hardware utilization alone.
84+
* **Cons:** This metric may not always correlate directly with resource consumption. For example, a few computationally expensive requests could saturate the GPU, while a large number of simple requests might not. If the application has issues reporting this metric, the HPA will not be able to scale the deployment correctly.
85+
86+
* **GPU Utilization (Hardware-Level Metric):**
87+
* **Pros:** This provides a direct measurement of how busy the underlying hardware is. It is a reliable indicator of resource saturation and is useful for optimizing costs by scaling down when the GPU is underutilized.
88+
* **Cons:** GPU utilization can be a lagging indicator. By the time utilization is high, the application's latency may have already increased. It also does not distinguish between a single, intensive request and multiple, less demanding ones.
89+
90+
#### **Combining Metrics for Robustness**
91+
92+
For the most robust autoscaling, you can configure the HPA to use multiple metrics. For example, you could scale up if *either* the number of running requests exceeds a certain threshold *or* if GPU utilization spikes. The HPA will scale the deployment up if any of the metrics cross their defined thresholds, but it will only scale down when *all* metrics are below their target values (respecting the scale-down stabilization window).
93+
94+
This combined approach provides several benefits:
95+
- **Proactive Scaling:** The HPA can scale up quickly in response to an increase in running requests, preventing latency spikes.
96+
- **Resource Protection:** It can also scale up if a small number of requests are consuming a large amount of GPU resources, preventing the server from becoming overloaded.
97+
- **Cost-Effective Scale-Down:** The deployment will only scale down when both the request load and GPU utilization are low, ensuring that resources are not removed prematurely.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# This Service provides a stable network endpoint for the NVIDIA DCGM Exporter
2+
# pods. The Prometheus Operator's ServiceMonitor will target this Service
3+
# to discover and scrape the GPU metrics. This is especially important
4+
# because the exporter pods are part of a DaemonSet, and their IPs can change.
5+
#
6+
# NOTE: This configuration is specific to GKE, which automatically deploys the
7+
# DCGM exporter in the 'gke-managed-system' namespace. For other cloud
8+
# providers or on-premise clusters, you would need to deploy your own DCGM
9+
# exporter (e.g., via a Helm chart) and update this Service's 'namespace'
10+
# and 'labels' to match your deployment.
11+
12+
apiVersion: v1
13+
kind: Service
14+
metadata:
15+
name: gke-managed-dcgm-exporter
16+
# GKE-SPECIFIC: GKE deploys its managed DCGM exporter in this namespace.
17+
# On other platforms, this would be the namespace where you deploy the exporter.
18+
namespace: gke-managed-system
19+
labels:
20+
# This label is critical. The ServiceMonitor uses this label to find this
21+
# specific Service. If the labels don't match, Prometheus will not be
22+
# able to discover the metrics endpoint.
23+
# GKE-SPECIFIC: This label is used by GKE's managed service. For a custom
24+
# deployment, you would use a more generic label like 'nvidia-dcgm-exporter'.
25+
app.kubernetes.io/name: gke-managed-dcgm-exporter
26+
spec:
27+
selector:
28+
# This selector tells the Service which pods to route traffic to.
29+
# It must match the labels on the DCGM exporter pods.
30+
# GKE-SPECIFIC: This selector matches the labels on GKE's managed DCGM pods.
31+
app.kubernetes.io/name: gke-managed-dcgm-exporter
32+
ports:
33+
- # The 'name' of this port is important. The ServiceMonitor will specifically
34+
# look for a port with this name to scrape metrics from.
35+
name: metrics
36+
port: 9400
37+
targetPort: 9400
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# This HorizontalPodAutoscaler (HPA) targets the vLLM deployment and scales
2+
# it based on the average GPU utilization across all pods. It uses the
3+
# custom metric 'gpu_utilization_percent', which is provided by the
4+
# Prometheus Adapter.
5+
6+
apiVersion: autoscaling/v2
7+
kind: HorizontalPodAutoscaler
8+
metadata:
9+
name: gemma-server-gpu-hpa
10+
spec:
11+
# scaleTargetRef points the HPA to the deployment it needs to scale.
12+
scaleTargetRef:
13+
apiVersion: apps/v1
14+
kind: Deployment
15+
name: vllm-gemma-deployment
16+
minReplicas: 1
17+
maxReplicas: 5
18+
metrics:
19+
- type: Pods
20+
pods:
21+
metric:
22+
# This is the custom metric that the HPA will query.
23+
# IMPORTANT: This name ('gpu_utilization_percent') is not the raw metric
24+
# from the DCGM exporter. It is the clean, renamed metric that is
25+
# exposed by the Prometheus Adapter. The names must match exactly.
26+
name: gpu_utilization_percent
27+
target:
28+
type: AverageValue
29+
# This is the target value for the metric. The HPA will add or remove
30+
# pods to keep the average GPU utilization across all pods at 20%.
31+
averageValue: 20
32+
behavior:
33+
scaleUp:
34+
# The stabilizationWindowSeconds is set to 0 to allow for immediate
35+
# scaling up. This is a trade-off:
36+
# - For highly volatile workloads, immediate scaling is critical to
37+
# maintain performance and responsiveness.
38+
# - However, this also introduces a risk of over-scaling if the workload
39+
# spikes are very brief. A non-zero value would make the scaling
40+
# less sensitive to short-lived spikes, but could introduce latency
41+
# if the load persists.
42+
stabilizationWindowSeconds: 0
43+
policies:
44+
- type: Pods
45+
value: 4
46+
periodSeconds: 15
47+
- type: Percent
48+
value: 100
49+
periodSeconds: 15
50+
selectPolicy: Max
51+
scaleDown:
52+
# The stabilizationWindowSeconds is set to 30 to prevent the HPA from
53+
# scaling down too aggressively. This means the controller will wait for
54+
# 30 seconds after a scale-down event before considering another one.
55+
# This helps to smooth out the scaling behavior and prevent "flapping"
56+
# (rapidly scaling up and down). A larger value will make the scaling
57+
# more conservative, which can be useful for workloads with fluctuating
58+
# metrics, but it may also result in higher costs if the resources are
59+
# not released quickly after a load decrease.
60+
stabilizationWindowSeconds: 30
61+
policies:
62+
- type: Percent
63+
value: 100
64+
periodSeconds: 15
65+
selectPolicy: Max

0 commit comments

Comments
 (0)