-
Notifications
You must be signed in to change notification settings - Fork 182
Description
What happened:
today SaturationDetector is checking 3 conditions per pod -
- Metrics are fresh (not stale).
- WaitingQueueSize <= QueueDepthThreshold.
- KVCacheUsagePercent <= KVCacheUtilThreshold.
consider the case where one pod came up and hasn't reported metrics for a while, assume that last time that it was reporting, it was in great shape (e.g., on startup, waiting queue size = 0, kvcache = 0).
so we have:
pod1 - able to serve requests. reports as healthy. but hasn't reported metrics for a while and last time we collected metrics from it, it's was idle.
pod2 - partially busy.
pod3 - partially busy.
every incoming request that will go through the saturation detection identify that pod1 shouldn't be considered as a final target, but still pod1 may be selected to serve requests and its stale metrics that shows it's idle, so pod1 has high chances of getting selected.
this may happen continuously, so multiple requests can be sent to pod1, leading to unpredicted state (pod1 may not be able to handle the load).
What you expected to happen:
pod freshness of metrics shouldn't be part of saturation detector, but rather these pods should be filtered when getting the candidate pods from the datastore. stale metrics is not a sign of saturation, it's a sign of invalid pod that shouldn't be taken into consideration for the request serving.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): - Inference extension version (use
git describe --tags --dirty --always
): - Cloud provider or hardware configuration:
- Install tools:
- Others:
cc: @LukeAVanDrie