@@ -173,6 +173,12 @@ semantic_router_cache_size 1247
173173# Security metrics
174174semantic_router_pii_detections_total{action="block"} 23
175175semantic_router_jailbreak_attempts_total{action="block"} 5
176+
177+ # Error metrics
178+ llm_request_errors_total{model="gpt-4",reason="timeout"} 12
179+ llm_request_errors_total{model="claude-3",reason="upstream_5xx"} 3
180+ llm_request_errors_total{model="phi4",reason="upstream_4xx"} 5
181+ llm_request_errors_total{model="phi4",reason="pii_policy_denied"} 8
176182```
177183
178184### Reasoning Mode Metrics
@@ -247,6 +253,30 @@ sum by (model) (increase(llm_model_cost_total{currency="USD"}[1h]))
247253sum by (reason_code) (increase(llm_routing_reason_codes_total[15m]))
248254```
249255
256+ ### Error Metrics
257+
258+ The router tracks request errors categorized by failure reason for monitoring and debugging.
259+
260+ - ` llm_request_errors_total{model, reason} `
261+ - Description: Total number of request errors categorized by failure reason
262+ - Labels:
263+ - model: target model name for the failed request
264+ - reason: error category (timeout, upstream_4xx, upstream_5xx, pii_policy_denied, jailbreak_block, parse_error, serialization_error, cancellation, classification_failed, unknown)
265+
266+ Example PromQL queries:
267+
268+ ``` prometheus
269+ # Total errors by reason over the last hour
270+ sum by (reason) (increase(llm_request_errors_total[1h]))
271+
272+ # Error rate by model over the last 15 minutes
273+ sum by (model) (increase(llm_request_errors_total[15m])) /
274+ sum by (model) (increase(llm_model_requests_total[15m]))
275+
276+ # PII policy blocks over the last 24 hours
277+ sum(increase(llm_request_errors_total{reason="pii_policy_denied"}[24h]))
278+ ```
279+
250280### Pricing Configuration
251281
252282Provide per-1M pricing for your models so the router can compute request cost and emit metrics/logs.
0 commit comments