Skip to content

Extended Toleration Operators for Threshold-Based Placement #5471

@helayoty

Description

@helayoty

Enhancement Description

Many production Kubernetes clusters blend on-demand (higher-SLA) and spot/preemptible (lower-SLA) nodes to optimize costs while maintaining reliability for critical workloads. Platform teams need a safe default that keeps most workloads away from risky capacity, while allowing specific workloads to opt-in with explicit thresholds like "SLA ≥ 95%".

Currently, NodeAffinity supports numeric comparisons (Gt, Lt, etc.) but lacks the operational benefits that taints/tolerations provide:

  • Policy orientation: NodeAffinity is per-pod; to keep most pods away from low-SLA nodes requires editing every workload. Taints invert control: nodes declare risk; only pods with matching tolerations may land.
  • Eviction semantics: Affinity has no eviction capability. Taints support NoExecute with tolerationSeconds, enabling operators to drain/evict pods when a node's SLA degrades or spot instances are reclaimed.
  • Operational ergonomics: Centralized, node-side policy is consistent with other safety taints (e.g., disk-pressure, memory-pressure).

This enhancement extends core/v1 Toleration to support numeric comparison operators (Lt, Le, Ge, Gt) when matching Node Taints. This preserves the well-understood safety model of taints/tolerations while enabling threshold-based placement for SLA-aware scheduling.

Benefits for DRA and AI Workloads

  • Cost-reliability optimization: Bind resource claims to reliability tiers via taints with opt-in tolerations
  • Stage-aware placement: Different pipeline stages can tolerate different risk levels explicitly
  • Resilience after preemption: Use NoExecute/tolerationSeconds for graceful drain and controlled failover
  • Multi-tenant fairness: Prevent monopolization of high-SLA resources by requiring explicit tolerations
  • Smooth burst handling: Allow bursts to land on low-SLA pools with clear safety boundaries

The scheduler impact is limited to the existing TaintToleration Filter; no new scheduling stages or algorithms are required.

/sig/scheduling
/sig/node
/stage/alpha

/cc @ahg-g @alculquicondor @johnbelamaric @sanposhiho @kubernetes/sig-scheduling-misc

Metadata

Metadata

Assignees

No one assigned

    Labels

    sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.

    Type

    No type

    Projects

    Status

    Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions