[RFC] Automatic Scorer-Plugin Weighting

## Problem Statement

The llm-d inference scheduler relies on multiple scoring plugins to select a target pod for each inference request. These scorers generally fall into two conceptual families:

- **Distributive scorers**
  - Aim to evenly distribute load across pods
  - Reduce hotspotting and compute underutilization
  - Commonly based on queue depth, in-flight requests, or utilization signals

- **Sticky scorers**
  - Aim to preserve execution locality
  - Increase scheduling stickiness to reduce unnecessary request movement
  - Often leverage pod-level affinity or reuse signals

The scheduler must balance these two families to achieve optimal throughput and latency.

Today, this balance is controlled through statically configured scorer weights. However:

- Static weights are difficult to tune due to differing scorer variances despite normalization
- Optimal weighting depends on runtime cluster conditions
- Static configuration does not adapt to heterogeneous or time-varying workloads

## Proposal

Introduce **automatic scorer plugin weighting** as an optional scheduler capability.

When enabled, the scheduler dynamically adjusts the relative influence of distributive and sticky scorer families at runtime based on observed cluster load conditions.

### Automatic Weighting Behavior

- Scorers are grouped into distributive and sticky families
- The scheduler observes load imbalance signals, such as:
  - Queue depth variance
  - In-flight request skew
  - Utilization imbalance
- When load imbalance increases:
  - Distributive scorers are up-weighted to spread traffic
- When load is balanced:
  - Weighting converges toward an equilibrium between the two families
- Adjustments are gradual and bounded to prevent oscillation

### Configuration

- A new configuration field enables or disables automatic weighting
- When enabled:
  - An equal aggregation of the two families sets the starting weights
    - E.g., a configuration with `prefix-cache-scorer` (sticky), `queue-scorer` (distributive) and `kv-cache-utilization-scorer` (distributive) would be 2:1:1.
  - Runtime adjustments are applied relative to these baselines
- Optional: users may bias the equilibrium point by choosing non-equal baseline weights (practically setting the user-configured weights as the baseline)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Automatic Scorer-Plugin Weighting #547

Problem Statement

Proposal

Automatic Weighting Behavior

Configuration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Automatic Scorer-Plugin Weighting #547

Description

Problem Statement

Proposal

Automatic Weighting Behavior

Configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions