Over-Saturation stopping

@markurtz This issue is dependent on #238.
When running a benchmark there are many reasons that could cause the server to Over-Saturate, meaning that the rate of guidellm generated requests exceeds the rate of server responses.
Usually this causes a major skew in the measured metrics, making them false and misleading. 

The proposed feature is to integrate an online Over-Saturation detection algorithm into guidellm.
The algorithm will be used in 3 ways:
1. Early stopping a benchmark which detects Over-Saturation
2. Early stopping a sweep which detects Over-Saturation
3. Output report indication of the detection of Over-Saturation

## Over-Saturation Detection Algorithm
I have evaluated an algorithm (see internal RedHat slack for links and documents) which achieves near perfect detection, both in terms of accuracy and time waste minimization.
The algorithm basically goes like this:
1. Wait at least 30 seconds (hyper-param, tunable)
2. Wait for median TTFT to get larger than 2.5 seconds (hyper-param, tunable)
3. Check the slope of concurrent requests over time
4. Check the slope of TTFT over time
5. If both are consistently positive (reasonable margin of error) - stop the benchmark.

## Is this issue addressed by an existing feature?
Currently, Over-Saturation is partially addressed by:
1. Throughput/sweep mode measure the max load which is achievable by a server. In practice, the current throughput mode detected RPS is very noisy and usually over-estimates it by large, therefore a sweep actually usually Over-Saturates the server in the last few constant benchmarks.
2. Max error stopping #238 will also help when an Over-Saturated server starts not responding and errors accumulate, but it takes a while for timeouts to start coming, and that time is wasted.

## Implementation Discussion
It seems very natural to me to put this logic somewhere between the `benchmarker.py`, `aggregator.py` and potentially a new class:
https://github.com/vllm-project/guidellm/blob/72374efdf7d4432173fafec3924dc94ac3b11449/src/guidellm/benchmark/benchmarker.py#L225
The `aggregator.py` or the new class could accumulate the necessary information for calculating the slopes and margin of errors (TTFT and concurrent requests over some window of time, e.g 1m), and if an over-saturation is detected, then the `benchmarker.py` will send a stop signal to the scheduler, set the termination reason to be "over-saturated-server", complete the benchmark gracefully and if it is a sweep - break the profile strategy loop.
@markurtz, what do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Over-Saturation stopping #242

Over-Saturation Detection Algorithm

Is this issue addressed by an existing feature?

Implementation Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Over-Saturation stopping #242

Description

Over-Saturation Detection Algorithm

Is this issue addressed by an existing feature?

Implementation Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions