You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ai/vllm-deployment/hpa/README.md
+23Lines changed: 23 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,3 +72,26 @@ helm install prometheus prometheus-community/kube-prometheus-stack --namespace m
72
72
## II. HPA for vLLM AI Inference Server using NVidia GPU metrics
73
73
74
74
[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md)
75
+
76
+
### Choosing the Right Metric: Trade-offs and Combining Metrics
77
+
78
+
This project provides two methods for autoscaling: one based on the number of running requests (`vllm:num_requests_running`) and the other on GPU utilization (`dcgm_fi_dev_gpu_util`). Each has its own advantages, and they can be combined for a more robust scaling strategy.
79
+
80
+
#### **Trade-offs**
81
+
82
+
***Number of Running Requests (Application-Level Metric):**
83
+
***Pros:** This is a direct measure of the application's current workload. It is highly responsive to sudden changes in traffic, making it ideal for latency-sensitive applications. Scaling decisions are based on the actual number of requests being processed, which can be a more accurate predictor of future load than hardware utilization alone.
84
+
***Cons:** This metric may not always correlate directly with resource consumption. For example, a few computationally expensive requests could saturate the GPU, while a large number of simple requests might not. If the application has issues reporting this metric, the HPA will not be able to scale the deployment correctly.
85
+
86
+
***GPU Utilization (Hardware-Level Metric):**
87
+
***Pros:** This provides a direct measurement of how busy the underlying hardware is. It is a reliable indicator of resource saturation and is useful for optimizing costs by scaling down when the GPU is underutilized.
88
+
***Cons:** GPU utilization can be a lagging indicator. By the time utilization is high, the application's latency may have already increased. It also does not distinguish between a single, intensive request and multiple, less demanding ones.
89
+
90
+
#### **Combining Metrics for Robustness**
91
+
92
+
For the most robust autoscaling, you can configure the HPA to use multiple metrics. For example, you could scale up if *either* the number of running requests exceeds a certain threshold *or* if GPU utilization spikes. The HPA will scale the deployment up if any of the metrics cross their defined thresholds, but it will only scale down when *all* metrics are below their target values (respecting the scale-down stabilization window).
93
+
94
+
This combined approach provides several benefits:
95
+
-**Proactive Scaling:** The HPA can scale up quickly in response to an increase in running requests, preventing latency spikes.
96
+
-**Resource Protection:** It can also scale up if a small number of requests are consuming a large amount of GPU resources, preventing the server from becoming overloaded.
97
+
-**Cost-Effective Scale-Down:** The deployment will only scale down when both the request load and GPU utilization are low, ensuring that resources are not removed prematurely.
0 commit comments