You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -150,14 +152,6 @@ The behavior of the SLO-aware router can be fine-tuned using the following envir
150
152
| `HEADROOM_TTFT_WEIGHT` | The weight to give to the TTFT when a pod has positive headroom. | `0.8` |
151
153
| `HEADROOM_TPOT_WEIGHT` | The weight to give to the TPOT when a pod has positive headroom. | `0.2` |
152
154
| `HEADROOM_SELECTION_STRATEGY` | The strategy to use for selecting a pod based on headroom. Options: `least`, `most`, `composite-least`, `composite-most`, `composite-only`. | `least` |
153
-
| `COMPOSITE_KV_WEIGHT` | The weight to give to the KV cache utilization in the composite score. | `1` |
154
-
| `COMPOSITE_QUEUE_WEIGHT` | The weight to give to the queue size in the composite score. | `1` |
155
-
| `COMPOSITE_PREFIX_WEIGHT` | The weight to give to the prefix cache score in the composite score. | `1` |
156
-
| `STICKY_EPSILON` | The probability of exploring a non-sticky pod. | `0.01` |
157
-
| `NEG_HEADROOM_EPSILON` | The probability of exploring a pod with negative headroom. | `0.01` |
158
-
| `AFFINITY_GATE_TAU` | The stickiness threshold for the affinity gate. | `0.80` |
159
-
| `AFFINITY_GATE_TAU_GLOBAL` | The global stickiness threshold for the affinity gate. | `0.99` |
160
-
| `POD_SELECTION_MODE` | The mode for selecting a pod from the weighted list. Options: `linear`(weighted random), `max` (argmax). | `linear` |
161
155
162
156
**Note:** Enabling SLO-aware routing also exposes a number of Prometheus metrics for monitoring the feature, including actual vs. predicted latency, SLO violations, and more.
Copy file name to clipboardExpand all lines: site-src/guides/slo-aware-routing.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,6 +61,8 @@ Key categories of metrics include:
61
61
-**SLO Violations**: Counters and gauges are available to track when SLOs are violated. This can be used to alert on SLO breaches.
62
62
-**SLO Thresholds**: The current SLO thresholds for TTFT and TPOT are also exposed as metrics.
63
63
64
+
NOTE: TPOT is equivalen to vLLM's **ITL** (Inter Token Latency), as vLLM defines TPOT as the average time per output token *including the TTFT*. This is commonly known as NTPOT in other contexts, and we don't capture that metric here.
65
+
64
66
The following is a comprehensive list of the Prometheus metrics exposed:
65
67
66
68
| Metric Name | Description |
@@ -81,5 +83,3 @@ The following is a comprehensive list of the Prometheus metrics exposed:
81
83
|`inference_objective_request_ttft_slo_violation_total`| Counter of TTFT SLO violations for each model and target model. |
82
84
|`inference_objective_request_tpot_slo_violation`| Boolean indicator (0 or 1) of whether the last TPOT measurement violated the SLO threshold for each model and target model. |
83
85
|`inference_objective_request_tpot_slo_violation_total`| Counter of TPOT SLO violations for each model and target model. |
84
-
|`inference_objective_request_ttft_slo_threshold_seconds`| Current TTFT SLO threshold in seconds for each model and target model. |
85
-
|`inference_objective_request_tpot_slo_threshold_seconds`| Current TPOT SLO threshold in seconds for each model and target model. |
0 commit comments