Skip to content

System-aware P/D decision making to adapt to workload shifts #611

@RishabhSaini

Description

@RishabhSaini

What would you like to be added:
Enhance P/D decision logic to consider system state (queue depths, worker load, predicted latencies, etc.), not just non-cached token counts. Allow the scheduler to incorporate dynamic metrics when deciding whether to use disaggregated prefill/decode.

Why is this needed:
Current token-threshold-based decisions can't adapt to provisioning or workload changes. Example: if workload shifts from 10k→20k prompt tokens and prefill workers become overloaded, the scheduler should automatically reduce P/D usage to avoid bottlenecks. System-aware decisions would enable dynamic adaptation without manual threshold tuning.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions