Skip to content

Commit 9f9fe74

Browse files
authored
metrics: Add sample alert rules (#912)
1 parent c300d26 commit 9f9fe74

File tree

2 files changed

+104
-1
lines changed

2 files changed

+104
-1
lines changed

site-src/guides/metrics.md

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,4 +93,69 @@ TOKEN=$(kubectl -n default get secret inference-gateway-sa-metrics-reader-secret
9393
kubectl -n default port-forward inference-gateway-ext-proc-pod-name 9090
9494
9595
curl -H "Authorization: Bearer $TOKEN" localhost:9090/metrics
96-
```
96+
```
97+
98+
## Prometheus Alerts
99+
100+
The section instructs how to configure prometheus alerts using collected metrics.
101+
102+
### Configure alerts
103+
104+
You can follow this [blog post](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/) for instruction of setting up alerts in your monitoring stacks with Prometheus.
105+
106+
A template alert rule is available at [alert.yaml](../../tools/alerts/alert.yaml). You can modify and append these rules to your existing Prometheus deployment.
107+
108+
#### High Inference Request Latency P99
109+
110+
```
111+
alert: HighInferenceRequestLatencyP99
112+
expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0 # Adjust threshold as needed (e.g., 10.0 seconds)
113+
for: 5m
114+
annotations:
115+
title: 'High latency (P99) for model {{ $labels.model_name }}'
116+
description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
117+
labels:
118+
severity: 'warning'
119+
```
120+
121+
#### High Inference Error Rate
122+
123+
```
124+
alert: HighInferenceErrorRate
125+
expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05 # Adjust threshold as needed (e.g., 5% error rate)
126+
for: 5m
127+
annotations:
128+
title: 'High error rate for model {{ $labels.model_name }}'
129+
description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
130+
labels:
131+
severity: 'critical'
132+
impact: 'availability'
133+
```
134+
135+
#### High Inference Pool Queue Average Size
136+
137+
```
138+
alert: HighInferencePoolAvgQueueSize
139+
expr: inference_pool_average_queue_size > 50 # Adjust threshold based on expected queue size
140+
for: 5m
141+
annotations:
142+
title: 'High average queue size for inference pool {{ $labels.name }}'
143+
description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
144+
labels:
145+
severity: 'critical'
146+
impact: 'performance'
147+
```
148+
149+
#### High Inference Pool Average KV Cache
150+
151+
```
152+
alert: HighInferencePoolAvgKVCacheUtilization
153+
expr: inference_pool_average_kv_cache_utilization > 0.9 # 90% utilization
154+
for: 5m
155+
annotations:
156+
title: 'High KV cache utilization for inference pool {{ $labels.name }}'
157+
description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
158+
labels:
159+
severity: 'critical'
160+
impact: 'resource_exhaustion'
161+
```

tools/alerts/alert.yaml

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
groups:
2+
- name: gateway-api-inference-extension
3+
rules:
4+
- alert: HighInferenceRequestLatencyP99
5+
annotations:
6+
title: 'High latency (P99) for model {{ $labels.model_name }}'
7+
description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
8+
expr: histogram_quantile(0.99, rate(inference_model_request_duration_seconds_bucket[5m])) > 10.0
9+
for: 5m
10+
labels:
11+
severity: 'warning'
12+
- alert: HighInferenceErrorRate
13+
annotations:
14+
title: 'High error rate for model {{ $labels.model_name }}'
15+
description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
16+
expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05
17+
for: 5m
18+
labels:
19+
severity: 'critical'
20+
impact: 'availability'
21+
- alert: HighInferencePoolAvgQueueSize
22+
annotations:
23+
title: 'High average queue size for inference pool {{ $labels.name }}'
24+
description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
25+
expr: inference_pool_average_queue_size > 50
26+
for: 5m
27+
labels:
28+
severity: 'critical'
29+
impact: 'performance'
30+
- alert: HighInferencePoolAvgKVCacheUtilization
31+
annotations:
32+
title: 'High KV cache utilization for inference pool {{ $labels.name }}'
33+
description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
34+
expr: inference_pool_average_kv_cache_utilization > 0.9
35+
for: 5m
36+
labels:
37+
severity: 'critical'
38+
impact: 'resource_exhaustion'

0 commit comments

Comments
 (0)