@@ -253,9 +253,9 @@ sum by (model) (increase(llm_model_cost_total{currency="USD"}[1h]))
253253sum by (reason_code) (increase(llm_routing_reason_codes_total[15m]))
254254```
255255
256- ### Error Metrics
256+ ### Request Error Metrics
257257
258- The router tracks request errors categorized by failure reason for monitoring and debugging .
258+ The router tracks request-level failures by model and reason so you can monitor both absolute error throughput and the share of requests that fail .
259259
260260- ` llm_request_errors_total{model, reason} `
261261 - Description: Total number of request errors categorized by failure reason
@@ -269,9 +269,14 @@ Example PromQL queries:
269269# Total errors by reason over the last hour
270270sum by (reason) (increase(llm_request_errors_total[1h]))
271271
272- # Error rate by model over the last 15 minutes
273- sum by (model) (increase(llm_request_errors_total[15m])) /
274- sum by (model) (increase(llm_model_requests_total[15m]))
272+ # Error throughput (errors/sec) by model over the last 15 minutes.
273+ # Helpful for incident response because it shows how many failing requests are impacting users.
274+ sum by (model) (rate(llm_request_errors_total[15m]))
275+
276+ # Error ratio (% of requests failing) by model over the last 15 minutes.
277+ # Use increase() to align numerator and denominator with the same lookback window.
278+ 100 * sum by (model) (increase(llm_request_errors_total[15m])) /
279+ sum by (model) (increase(llm_model_requests_total[15m]))
275280
276281# PII policy blocks over the last 24 hours
277282sum(increase(llm_request_errors_total{reason="pii_policy_denied"}[24h]))
0 commit comments