Skip to content

[RFC] Automatic Scorer-Plugin Weighting #547

@vMaroon

Description

@vMaroon

Problem Statement

The llm-d inference scheduler relies on multiple scoring plugins to select a target pod for each inference request. These scorers generally fall into two conceptual families:

  • Distributive scorers

    • Aim to evenly distribute load across pods
    • Reduce hotspotting and compute underutilization
    • Commonly based on queue depth, in-flight requests, or utilization signals
  • Sticky scorers

    • Aim to preserve execution locality
    • Increase scheduling stickiness to reduce unnecessary request movement
    • Often leverage pod-level affinity or reuse signals

The scheduler must balance these two families to achieve optimal throughput and latency.

Today, this balance is controlled through statically configured scorer weights. However:

  • Static weights are difficult to tune due to differing scorer variances despite normalization
  • Optimal weighting depends on runtime cluster conditions
  • Static configuration does not adapt to heterogeneous or time-varying workloads

Proposal

Introduce automatic scorer plugin weighting as an optional scheduler capability.

When enabled, the scheduler dynamically adjusts the relative influence of distributive and sticky scorer families at runtime based on observed cluster load conditions.

Automatic Weighting Behavior

  • Scorers are grouped into distributive and sticky families
  • The scheduler observes load imbalance signals, such as:
    • Queue depth variance
    • In-flight request skew
    • Utilization imbalance
  • When load imbalance increases:
    • Distributive scorers are up-weighted to spread traffic
  • When load is balanced:
    • Weighting converges toward an equilibrium between the two families
  • Adjustments are gradual and bounded to prevent oscillation

Configuration

  • A new configuration field enables or disables automatic weighting
  • When enabled:
    • An equal aggregation of the two families sets the starting weights
      • E.g., a configuration with prefix-cache-scorer (sticky), queue-scorer (distributive) and kv-cache-utilization-scorer (distributive) would be 2:1:1.
    • Runtime adjustments are applied relative to these baselines
  • Optional: users may bias the equilibrium point by choosing non-equal baseline weights (practically setting the user-configured weights as the baseline)

Metadata

Metadata

Assignees

No one assigned

    Labels

    holdneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions