Skip to content

Add Native Support for Kubernetes Workload API to Enable Gang Scheduling #275

@jasonliu747

Description

@jasonliu747

What you would like to be added?

We would like Grove to add first-class support for Kubernetes’ new Workload API kubernetes/enhancements#4671 as a gang scheduling backend.

Concretely:

  • Convert PodCliqueSet into a native Workload
  • Attach workloadRef to generated Pods
  • Use upstream kube-scheduler’s native gang admission logic
  • Provide a configuration option (e.g., schedulerMode: workload vs podgang)
  • Allow Grove workflows to operate fully without depending on KAI Scheduler

This would allow Grove to orchestrate multi-role workloads using kube-scheduler + Workload as the underlying gang scheduler.

Why is this needed?

  1. Upstream Kubernetes has standardized gang scheduling via Workload. Grove should evolve alongside the Kubernetes ecosystem.
  2. Running additional schedulers like KAI is operationally expensive for large production clusters. Relying on the default kube-scheduler greatly simplifies adoption.
  3. Workload already provides the gang semantics Grove needs, including atomic admission and minMember guarantees.
  4. Training workloads (VCJob, PyTorchJob, TFJob, etc.) increasingly rely on Workload, especially when combined with Kueue. Native Workload support makes Grove far more practical for training-focused environments.
  5. Reducing scheduling stack fragmentation improves maintainability and encourages broader adoption of Grove in enterprise clusters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions