You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -69,8 +69,8 @@ These metrics come from the decode worker pods' system endpoints (port 9090). In
69
69
|**Component Latency - Prefill vs Decode**|`dynamo_component_request_duration_seconds_{sum,count}{dynamo_component="prefill",dynamo_endpoint="generate"}` & `{dynamo_component="backend",dynamo_endpoint="generate"}`|`rate(sum[5m]) / rate(count[5m])`| Average request duration for prefill workers (includes NIXL transfer) vs decode workers (entire decode session for all output tokens) over the last 5 minutes. **Note**: Decode worker latency measures the FULL decode session duration, not just time to first token. Only shows `generate` endpoint (filters out `clear_kv_blocks` maintenance operations) |
70
70
|**Decode Worker - Request Throughput**|`dynamo_component_requests_total{dynamo_component="backend"}`|`rate(...[5m])`| Rate of requests processed by decode workers in requests/second |
71
71
|**Decode Worker - Avg Request Duration**|`dynamo_component_request_duration_seconds_{sum,count}{dynamo_component="backend"}`|`rate(sum[5m]) / rate(count[5m])`| Average time decode workers spend processing requests (decode phase only) over the last 5 minutes |
72
-
|**KV Cache Utilization**|`dynamo_component_kvstats_gpu_cache_usage_percent`| Raw value (0-100%) | GPU memory utilization for KV cache storage of active requests. High values (>90%) indicate workers are at capacity and requests are queueing. **Note**: Only available for decode workers - prefill workers in disaggregated mode don't expose this metric. Monitor Prefill Worker Processing Time instead for prefill capacity |
73
-
|**KV Cache Blocks (Active/Total)**⭐ |`dynamo_component_kvstats_active_blocks` & `dynamo_component_kvstats_total_blocks`| Raw values|Number of KV cache blocks in use vs total available for decode workers. When active approaches total, decode workers are at capacity. Shows numeric values (e.g., 2048/5297). **Note**: Only for decode workers |
72
+
|**KV Cache Utilization**|`dynamo_component_gpu_cache_usage_percent`| Raw value (0-100%) | GPU memory utilization for KV cache storage of active requests. High values (>90%) indicate workers are at capacity and requests are queueing. **Note**: Only available for decode workers - prefill workers in disaggregated mode don't expose this metric. Monitor Prefill Worker Processing Time instead for prefill capacity |
73
+
|**KV Cache Blocks (Total)**|`dynamo_component_total_blocks`| Raw value|Total number of KV cache blocks available on decode workers. **Note**: Only for decode workers |
74
74
75
75
### CPU Metrics (from cAdvisor and Node Exporter)
76
76
These metrics come from Kubernetes cAdvisor (container metrics) and Node Exporter (node-level metrics). CPU bottlenecks can impact prefill/decode performance.
@@ -194,7 +194,7 @@ The DCGM ServiceMonitor must be manually created (see `dcgm-servicemonitor.yaml`
194
194
- Check deployment mode and request routing configuration
195
195
196
196
### KV Cache metrics only showing decode workers:
197
-
**Important Limitation**: In disaggregated mode, prefill workers (`--disaggregation-mode prefill`) do NOT expose `dynamo_component_kvstats_*` metrics. Only decode workers expose these.
197
+
**Important Limitation**: In disaggregated mode, prefill workers (`--disaggregation-mode prefill`) do NOT expose `dynamo_component_total_blocks` or `dynamo_component_gpu_cache_usage_percent` metrics. Only decode workers expose these.
198
198
199
199
**Why this happens:**
200
200
- Prefill workers transfer KV cache to decode workers via NIXL
"description": "Active KV cache blocks vs total available blocks for decode workers. Shows numeric capacity utilization. When active approaches total, workers are at capacity.",
1153
+
"description": "Total KV cache blocks available on decode workers. Shows numeric capacity.",
"description": "Active KV cache blocks vs total available blocks for decode workers. Shows numeric capacity utilization. When active approaches total, workers are at capacity.",
1165
+
"description": "Total KV cache blocks available on decode workers. Shows numeric capacity.",
0 commit comments