-
Notifications
You must be signed in to change notification settings - Fork 125
Open
Labels
holdneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
Description
Problem Statement
The llm-d inference scheduler relies on multiple scoring plugins to select a target pod for each inference request. These scorers generally fall into two conceptual families:
-
Distributive scorers
- Aim to evenly distribute load across pods
- Reduce hotspotting and compute underutilization
- Commonly based on queue depth, in-flight requests, or utilization signals
-
Sticky scorers
- Aim to preserve execution locality
- Increase scheduling stickiness to reduce unnecessary request movement
- Often leverage pod-level affinity or reuse signals
The scheduler must balance these two families to achieve optimal throughput and latency.
Today, this balance is controlled through statically configured scorer weights. However:
- Static weights are difficult to tune due to differing scorer variances despite normalization
- Optimal weighting depends on runtime cluster conditions
- Static configuration does not adapt to heterogeneous or time-varying workloads
Proposal
Introduce automatic scorer plugin weighting as an optional scheduler capability.
When enabled, the scheduler dynamically adjusts the relative influence of distributive and sticky scorer families at runtime based on observed cluster load conditions.
Automatic Weighting Behavior
- Scorers are grouped into distributive and sticky families
- The scheduler observes load imbalance signals, such as:
- Queue depth variance
- In-flight request skew
- Utilization imbalance
- When load imbalance increases:
- Distributive scorers are up-weighted to spread traffic
- When load is balanced:
- Weighting converges toward an equilibrium between the two families
- Adjustments are gradual and bounded to prevent oscillation
Configuration
- A new configuration field enables or disables automatic weighting
- When enabled:
- An equal aggregation of the two families sets the starting weights
- E.g., a configuration with
prefix-cache-scorer(sticky),queue-scorer(distributive) andkv-cache-utilization-scorer(distributive) would be 2:1:1.
- E.g., a configuration with
- Runtime adjustments are applied relative to these baselines
- An equal aggregation of the two families sets the starting weights
- Optional: users may bias the equilibrium point by choosing non-equal baseline weights (practically setting the user-configured weights as the baseline)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
holdneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
Type
Projects
Status
Backlog