You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The section instructs how to configure prometheus alerts using collected metrics.
101
+
102
+
### Configure alerts
103
+
104
+
You can follow this [blog post](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/) for instruction of setting up alerts in your monitoring stacks with Prometheus.
105
+
106
+
A template alert rule is available at [alert.yaml](../../tools/alerts/alert.yaml). You can modify and append these rules to your existing Prometheus deployment.
title: 'High latency (P99) for model {{ $labels.model_name }}'
116
+
description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
117
+
labels:
118
+
severity: 'warning'
119
+
```
120
+
121
+
#### High Inference Error Rate
122
+
123
+
```
124
+
alert: HighInferenceErrorRate
125
+
expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05 # Adjust threshold as needed (e.g., 5% error rate)
126
+
for: 5m
127
+
annotations:
128
+
title: 'High error rate for model {{ $labels.model_name }}'
129
+
description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
130
+
labels:
131
+
severity: 'critical'
132
+
impact: 'availability'
133
+
```
134
+
135
+
#### High Inference Pool Queue Average Size
136
+
137
+
```
138
+
alert: HighInferencePoolAvgQueueSize
139
+
expr: inference_pool_average_queue_size > 50 # Adjust threshold based on expected queue size
140
+
for: 5m
141
+
annotations:
142
+
title: 'High average queue size for inference pool {{ $labels.name }}'
143
+
description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
title: 'High KV cache utilization for inference pool {{ $labels.name }}'
157
+
description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
title: 'High latency (P99) for model {{ $labels.model_name }}'
7
+
description: 'The 99th percentile request duration for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 10.0 seconds for 5 minutes.'
title: 'High error rate for model {{ $labels.model_name }}'
15
+
description: 'The error rate for model {{ $labels.model_name }} and target model {{ $labels.target_model_name }} has been consistently above 5% for 5 minutes.'
16
+
expr: sum by (model_name) (rate(inference_model_request_error_total[5m])) / sum by (model_name) (rate(inference_model_request_total[5m])) > 0.05
17
+
for: 5m
18
+
labels:
19
+
severity: 'critical'
20
+
impact: 'availability'
21
+
- alert: HighInferencePoolAvgQueueSize
22
+
annotations:
23
+
title: 'High average queue size for inference pool {{ $labels.name }}'
24
+
description: 'The average number of requests pending in the queue for inference pool {{ $labels.name }} has been consistently above 50 for 5 minutes.'
25
+
expr: inference_pool_average_queue_size > 50
26
+
for: 5m
27
+
labels:
28
+
severity: 'critical'
29
+
impact: 'performance'
30
+
- alert: HighInferencePoolAvgKVCacheUtilization
31
+
annotations:
32
+
title: 'High KV cache utilization for inference pool {{ $labels.name }}'
33
+
description: 'The average KV cache utilization for inference pool {{ $labels.name }} has been consistently above 90% for 5 minutes, indicating potential resource exhaustion.'
0 commit comments