-
Notifications
You must be signed in to change notification settings - Fork 130
Description
Problem Statement
When using --detect-saturation with large models and long input contexts (e.g., 10,000 tokens), the benchmark stops prematurely before any request completes.
Root Cause
The over-saturation constraint uses both concurrent request slope and TTFT slope to detect saturation. TTFT data is only communicated to the constraint when a request fully completes.
With large models and long contexts, no request completes within the minimum_duration window (default: 30s), so the constraint has zero TTFT data and falls back to concurrent slope alone — which naturally rises during ramp-up.
The TTFT value is available much earlier — the backend knows it as soon as the first token arrives during streaming — but it is not communicated back to the main process until the request completes.
Proposed Solution
When over-saturation detection is enabled, send a bounded number of early TTFT notifications from the worker process when the first token arrives during streaming, before the request completes. This gives the constraint real TTFT data to make a two-signal decision instead of falling back to concurrent slope alone.
The count of early notifications should be configurable and small (e.g., 5 per worker) to limit multi-process queue overhead. When over-saturation detection is not enabled, no extra notifications are sent and behavior is unchanged.
Alternatives Considered
No response
Usage Examples
Additional Context
No response