-
Notifications
You must be signed in to change notification settings - Fork 67
Description
@markurtz This issue is dependent on #238.
When running a benchmark there are many reasons that could cause the server to Over-Saturate, meaning that the rate of guidellm generated requests exceeds the rate of server responses.
Usually this causes a major skew in the measured metrics, making them false and misleading.
The proposed feature is to integrate an online Over-Saturation detection algorithm into guidellm.
The algorithm will be used in 3 ways:
- Early stopping a benchmark which detects Over-Saturation
- Early stopping a sweep which detects Over-Saturation
- Output report indication of the detection of Over-Saturation
Over-Saturation Detection Algorithm
I have evaluated an algorithm (see internal RedHat slack for links and documents) which achieves near perfect detection, both in terms of accuracy and time waste minimization.
The algorithm basically goes like this:
- Wait at least 30 seconds (hyper-param, tunable)
- Wait for median TTFT to get larger than 2.5 seconds (hyper-param, tunable)
- Check the slope of concurrent requests over time
- Check the slope of TTFT over time
- If both are consistently positive (reasonable margin of error) - stop the benchmark.
Is this issue addressed by an existing feature?
Currently, Over-Saturation is partially addressed by:
- Throughput/sweep mode measure the max load which is achievable by a server. In practice, the current throughput mode detected RPS is very noisy and usually over-estimates it by large, therefore a sweep actually usually Over-Saturates the server in the last few constant benchmarks.
- Max error stopping Feat/max error rate - continued #238 will also help when an Over-Saturated server starts not responding and errors accumulate, but it takes a while for timeouts to start coming, and that time is wasted.
Implementation Discussion
It seems very natural to me to put this logic somewhere between the benchmarker.py
, aggregator.py
and potentially a new class:
aggregator.add_result(result) |
The
aggregator.py
or the new class could accumulate the necessary information for calculating the slopes and margin of errors (TTFT and concurrent requests over some window of time, e.g 1m), and if an over-saturation is detected, then the benchmarker.py
will send a stop signal to the scheduler, set the termination reason to be "over-saturated-server", complete the benchmark gracefully and if it is a sweep - break the profile strategy loop.@markurtz, what do you think?