Skip to content

Over-saturation detection stops prematurely with long contexts #606

@ushaket

Description

@ushaket

Problem Statement

When using --detect-saturation with large models and long input contexts (e.g., 10,000 tokens), the benchmark stops prematurely before any request completes.

Root Cause

The over-saturation constraint uses both concurrent request slope and TTFT slope to detect saturation. TTFT data is only communicated to the constraint when a request fully completes.

With large models and long contexts, no request completes within the minimum_duration window (default: 30s), so the constraint has zero TTFT data and falls back to concurrent slope alone — which naturally rises during ramp-up.

The TTFT value is available much earlier — the backend knows it as soon as the first token arrives during streaming — but it is not communicated back to the main process until the request completes.

Proposed Solution

When over-saturation detection is enabled, send a bounded number of early TTFT notifications from the worker process when the first token arrives during streaming, before the request completes. This gives the constraint real TTFT data to make a two-signal decision instead of falling back to concurrent slope alone.

The count of early notifications should be configurable and small (e.g., 5 per worker) to limit multi-process queue overhead. When over-saturation detection is not enabled, no extra notifications are sent and behavior is unchanged.

Alternatives Considered

No response

Usage Examples

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions