What you would like to be added?
Rolling update dynamically update a PodClique's podTemplate nodeAffinity rules, without triggering pod deletions/restarts (non-disruptive update).
Why is this needed?
Training workloads may encounter machine failures. The mitigation strategy includes:
- Automated/Manual Action: Adjust Job affinity to prevent scheduling new pods on failed nodes.
- Pod Recovery: Recreate affected pods with strict
nodeAffinity rules against faulty nodes.
- Preservation: Unaffected pods continue running to minimize disruption.
Currently, in step1, Grove will delete all pods during rolling update, including unaffected ones.