Skip to content

Enhance the rolling update approach for PodClique #291

@xulinfei1996

Description

@xulinfei1996

What you would like to be added?

Rolling update dynamically update a PodClique's podTemplate nodeAffinity rules, without triggering pod deletions/restarts (non-disruptive update).

Why is this needed?

Training workloads may encounter machine failures. The mitigation strategy includes:

  1. Automated/Manual Action: Adjust Job affinity to prevent scheduling new pods on failed nodes.
  2. Pod Recovery: Recreate affected pods with strict nodeAffinity rules against faulty nodes.
  3. Preservation: Unaffected pods continue running to minimize disruption.

Currently, in step1, Grove will delete all pods during rolling update, including unaffected ones.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions