|
| 1 | +# Horizontal Pod Autoscaling AI Inference Server |
| 2 | + |
| 3 | +This exercise shows how to set up the infrastructure to automatically |
| 4 | +scale an AI inference server, using custom metrics (either server |
| 5 | +or GPU metrics). This exercise requires a running Prometheus instance, |
| 6 | +preferably managed by the Prometheus Operator. We assume |
| 7 | +you already have the vLLM AI inference server running from this |
| 8 | +[exercise](../README.md), in the parent directory. |
| 9 | + |
| 10 | +## Architecture |
| 11 | + |
| 12 | +The autoscaling solution works as follows: |
| 13 | + |
| 14 | +1. The **vLLM Server** or the **NVIDIA DCGM Exporter** exposes raw metrics on a `/metrics` endpoint. |
| 15 | +2. A **ServiceMonitor** resource declaratively specifies how Prometheus should discover and scrape these metrics. |
| 16 | +3. The **Prometheus Operator** detects the `ServiceMonitor` and configures its managed **Prometheus Server** instance to begin scraping the metrics. |
| 17 | +4. For GPU metrics, a **PrometheusRule** is used to relabel the raw DCGM metrics, creating a new, HPA-compatible metric. |
| 18 | +5. The **Prometheus Adapter** queries the Prometheus Server for the processed metrics and exposes them through the Kubernetes custom metrics API. |
| 19 | +6. The **Horizontal Pod Autoscaler (HPA)** controller queries the custom metrics API for the metrics and compares them to the target values defined in the `HorizontalPodAutoscaler` resource. |
| 20 | +7. If the metrics exceed the target, the HPA scales up the `vllm-gemma-deployment`. |
| 21 | + |
| 22 | + |
| 23 | +```mermaid |
| 24 | +flowchart TD |
| 25 | + D("PrometheusRule (GPU Metric Only)") |
| 26 | + B("Prometheus Server") |
| 27 | + C("ServiceMonitor") |
| 28 | + subgraph subGraph0["Metrics Collection"] |
| 29 | + A["vLLM Server"] |
| 30 | + H["GPU DCGM Exporter"] |
| 31 | + end |
| 32 | + subgraph subGraph1["HPA Scaling Logic"] |
| 33 | + E("Prometheus Adapter") |
| 34 | + F("API Server (Custom Metrics)") |
| 35 | + G("HPA Controller") |
| 36 | + end |
| 37 | + B -- Scrapes Raw Metrics --> A |
| 38 | + B -- Scrapes Raw Metrics --> H |
| 39 | + C -- Configures Scrape <--> B |
| 40 | + B -- Processes Raw Metrics via --> D |
| 41 | + D -- Creates Clean Metric in --> B |
| 42 | + F -- Custom Metrics API <--> E |
| 43 | + E -- Queries Processed Metric <--> B |
| 44 | + G -- Queries Custom Metric --> F |
| 45 | +``` |
| 46 | + |
| 47 | + |
| 48 | +## Prerequisites |
| 49 | + |
| 50 | +This guide assumes you have a running Kubernetes cluster and `kubectl` installed. The vLLM server will be deployed in the `vllm-example` namespace, and the Prometheus resources will be in the `monitoring` namespace. The HPA resources will be deployed to the `vllm-example` namespace by specifying the namespace on the command line. |
| 51 | + |
| 52 | +> **Note on Cluster Permissions:** This exercise requires permissions to install components that run on the cluster nodes themselves. The Prometheus Operator and the NVIDIA DCGM Exporter both deploy DaemonSets that require privileged access to the nodes to collect metrics. For GKE users, this means a **GKE Standard** cluster is required, as GKE Autopilot's security model restricts this level of node access. |
| 53 | +
|
| 54 | +### Prometheus Operator Installation |
| 55 | + |
| 56 | +The following commands will install the Prometheus Operator. It is recommended to install it in its own `monitoring` namespace. |
| 57 | + |
| 58 | +```bash |
| 59 | +# Add the Prometheus community Helm repository |
| 60 | +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts/ |
| 61 | +helm repo update |
| 62 | + |
| 63 | +# Install the Prometheus Operator into the "monitoring" namespace |
| 64 | +helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace |
| 65 | +``` |
| 66 | +**Note:** The default configuration of the Prometheus Operator only watches for `ServiceMonitor` resources within its own namespace. The `vllm-service-monitor.yaml` is configured to be in the `monitoring` namespace and watch for services in the `vllm-example` namespace, so no extra configuration is needed. |
| 67 | + |
| 68 | +## I. HPA for vLLM AI Inference Server using vLLM metrics |
| 69 | + |
| 70 | +[vLLM AI Inference Server HPA](./vllm-hpa.md) |
| 71 | + |
| 72 | +## II. HPA for vLLM AI Inference Server using NVidia GPU metrics |
| 73 | + |
| 74 | +[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md) |
| 75 | + |
| 76 | +### Choosing the Right Metric: Trade-offs and Combining Metrics |
| 77 | + |
| 78 | +This project provides two methods for autoscaling: one based on the number of running requests (`vllm:num_requests_running`) and the other on GPU utilization (`dcgm_fi_dev_gpu_util`). Each has its own advantages, and they can be combined for a more robust scaling strategy. |
| 79 | + |
| 80 | +#### **Trade-offs** |
| 81 | + |
| 82 | +* **Number of Running Requests (Application-Level Metric):** |
| 83 | + * **Pros:** This is a direct measure of the application's current workload. It is highly responsive to sudden changes in traffic, making it ideal for latency-sensitive applications. Scaling decisions are based on the actual number of requests being processed, which can be a more accurate predictor of future load than hardware utilization alone. |
| 84 | + * **Cons:** This metric may not always correlate directly with resource consumption. For example, a few computationally expensive requests could saturate the GPU, while a large number of simple requests might not. If the application has issues reporting this metric, the HPA will not be able to scale the deployment correctly. |
| 85 | + |
| 86 | +* **GPU Utilization (Hardware-Level Metric):** |
| 87 | + * **Pros:** This provides a direct measurement of how busy the underlying hardware is. It is a reliable indicator of resource saturation and is useful for optimizing costs by scaling down when the GPU is underutilized. |
| 88 | + * **Cons:** GPU utilization can be a lagging indicator. By the time utilization is high, the application's latency may have already increased. It also does not distinguish between a single, intensive request and multiple, less demanding ones. |
| 89 | + |
| 90 | +#### **Combining Metrics for Robustness** |
| 91 | + |
| 92 | +For the most robust autoscaling, you can configure the HPA to use multiple metrics. For example, you could scale up if *either* the number of running requests exceeds a certain threshold *or* if GPU utilization spikes. The HPA will scale the deployment up if any of the metrics cross their defined thresholds, but it will only scale down when *all* metrics are below their target values (respecting the scale-down stabilization window). |
| 93 | + |
| 94 | +This combined approach provides several benefits: |
| 95 | +- **Proactive Scaling:** The HPA can scale up quickly in response to an increase in running requests, preventing latency spikes. |
| 96 | +- **Resource Protection:** It can also scale up if a small number of requests are consuming a large amount of GPU resources, preventing the server from becoming overloaded. |
| 97 | +- **Cost-Effective Scale-Down:** The deployment will only scale down when both the request load and GPU utilization are low, ensuring that resources are not removed prematurely. |
0 commit comments