Skip to content

Commit c271df2

Browse files
authored
Fix bullet points
1 parent 92a312a commit c271df2

File tree

1 file changed

+6
-5
lines changed

1 file changed

+6
-5
lines changed

site-src/guides/troubleshooting.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,13 @@ Solution: Ensure you have an HTTPRoute resource deployed that specifies the corr
1919
This error indicates that the entire request pool has exceeded its saturation thresholds. This means the system is under heavy load and is shedding non-critical requests. To address this, check the following:
2020

2121
* gateway-api-inference-extension version:
22-
* **v0.5.1 and earlier**: Verify you're using an `InferenceModel` and that its `criticality` is set to `Critical`. This ensures requests are queued on the model servers instead of being dropped.
23-
* **v1.0.0 and later**: Ensure the `InferenceObjective` you're using has a `priority` greater than or equal to 0. A negative priority can cause requests to be dropped.
22+
* **v0.5.1 and earlier**: Verify you're using an `InferenceModel` and that its `criticality` is set to `Critical`. This ensures requests are queued on the model servers instead of being dropped.
23+
* **v1.0.0 and later**: Ensure the `InferenceObjective` you're using has a `priority` greater than or equal to 0. A negative priority can cause requests to be dropped.
24+
2425
* Pool Thresholds: Check the defined pool [thresholds](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/f36111cab0ed5a309d1eafade896d4f37ab623a6/pkg/epp/saturationdetector/config.go#L41) to understand the saturation limits. Currently, we use three main metrics to assess the system's load:
25-
* `DefaultQueueDepthThreshold`: This is the maximum number of requests waiting in the queue for a backend. The default value is 5. If the queue for a model server exceeds this number, the saturation detector may consider the system under pressure. To override this, set the `SD_QUEUE_DEPTH_THRESHOLD` environment variable.
26-
* `DefaultKVCacheUtilThreshold`: This is the maximum utilization of the Key-Value (KV) cache on the model server, expressed as a decimal from 0.0 to 1.0. The default is 0.8, or 80%. The KV cache stores attention keys and values to speed up inference for subsequent tokens. When its utilization exceeds this threshold, it's an indication that the model server is nearing its memory capacity and may be becoming saturated. To override this, set the `SD_KV_CACHE_UTIL_THRESHOLD` environment variable.
27-
* `DefaultMetricsStalenessThreshold`: This defines the maximum age of metrics data before it's considered outdated. The default is 200 milliseconds. The saturation detector needs up-to-date metrics to make accurate decisions about system load. If the metrics are older than this threshold, the detector won't use them. This value is tied to how often metrics are refreshed, and setting it slightly higher ensures that there's always fresh data available. To override this, set the `SD_METRICS_STALENESS_THRESHOLD` environment variable.
26+
* `DefaultQueueDepthThreshold`: This is the maximum number of requests waiting in the queue for a backend. The default value is 5. If the queue for a model server exceeds this number, the saturation detector may consider the system under pressure. To override this, set the `SD_QUEUE_DEPTH_THRESHOLD` environment variable.
27+
* `DefaultKVCacheUtilThreshold`: This is the maximum utilization of the Key-Value (KV) cache on the model server, expressed as a decimal from 0.0 to 1.0. The default is 0.8, or 80%. The KV cache stores attention keys and values to speed up inference for subsequent tokens. When its utilization exceeds this threshold, it's an indication that the model server is nearing its memory capacity and may be becoming saturated. To override this, set the `SD_KV_CACHE_UTIL_THRESHOLD` environment variable.
28+
* `DefaultMetricsStalenessThreshold`: This defines the maximum age of metrics data before it's considered outdated. The default is 200 milliseconds. The saturation detector needs up-to-date metrics to make accurate decisions about system load. If the metrics are older than this threshold, the detector won't use them. This value is tied to how often metrics are refreshed, and setting it slightly higher ensures that there's always fresh data available. To override this, set the `SD_METRICS_STALENESS_THRESHOLD` environment variable.
2829

2930
## 500 Internal Server Error
3031
### `fault filter abort`

0 commit comments

Comments
 (0)