Skip to content

Commit 66a2615

Browse files
committed
Added documentation on metrics trade-off and combinations
1 parent 8659bbb commit 66a2615

File tree

1 file changed

+23
-0
lines changed

1 file changed

+23
-0
lines changed

ai/vllm-deployment/hpa/README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,26 @@ helm install prometheus prometheus-community/kube-prometheus-stack --namespace m
7272
## II. HPA for vLLM AI Inference Server using NVidia GPU metrics
7373

7474
[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md)
75+
76+
### Choosing the Right Metric: Trade-offs and Combining Metrics
77+
78+
This project provides two methods for autoscaling: one based on the number of running requests (`vllm:num_requests_running`) and the other on GPU utilization (`dcgm_fi_dev_gpu_util`). Each has its own advantages, and they can be combined for a more robust scaling strategy.
79+
80+
#### **Trade-offs**
81+
82+
* **Number of Running Requests (Application-Level Metric):**
83+
* **Pros:** This is a direct measure of the application's current workload. It is highly responsive to sudden changes in traffic, making it ideal for latency-sensitive applications. Scaling decisions are based on the actual number of requests being processed, which can be a more accurate predictor of future load than hardware utilization alone.
84+
* **Cons:** This metric may not always correlate directly with resource consumption. For example, a few computationally expensive requests could saturate the GPU, while a large number of simple requests might not. If the application has issues reporting this metric, the HPA will not be able to scale the deployment correctly.
85+
86+
* **GPU Utilization (Hardware-Level Metric):**
87+
* **Pros:** This provides a direct measurement of how busy the underlying hardware is. It is a reliable indicator of resource saturation and is useful for optimizing costs by scaling down when the GPU is underutilized.
88+
* **Cons:** GPU utilization can be a lagging indicator. By the time utilization is high, the application's latency may have already increased. It also does not distinguish between a single, intensive request and multiple, less demanding ones.
89+
90+
#### **Combining Metrics for Robustness**
91+
92+
For the most robust autoscaling, you can configure the HPA to use multiple metrics. For example, you could scale up if *either* the number of running requests exceeds a certain threshold *or* if GPU utilization spikes. The HPA will scale the deployment up if any of the metrics cross their defined thresholds, but it will only scale down when *all* metrics are below their target values (respecting the scale-down stabilization window).
93+
94+
This combined approach provides several benefits:
95+
- **Proactive Scaling:** The HPA can scale up quickly in response to an increase in running requests, preventing latency spikes.
96+
- **Resource Protection:** It can also scale up if a small number of requests are consuming a large amount of GPU resources, preventing the server from becoming overloaded.
97+
- **Cost-Effective Scale-Down:** The deployment will only scale down when both the request load and GPU utilization are low, ensuring that resources are not removed prematurely.

0 commit comments

Comments
 (0)