[GREP] Enhance Gang Termination with Stuck Terminating Policy by kangclzjc · Pull Request #466 · ai-dynamo/grove

kangclzjc · 2026-03-02T11:32:28Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Grove’s gang termination today deletes and recreates PodCliques (and their pods) when MinAvailable is breached for longer than TerminationDelay. In environments where pods are constrained to a topology (e.g. same rack), node or kubelet failures can leave pods stuck in a terminating state: the API server has set deletionTimestamp but the kubelet never completes termination. Those pods are excluded from ready/scheduled counts, so MinAvailable is breached and gang termination runs; however, the stuck pods are still present and can block or complicate cleanup and rescheduling. This GREP proposes a configurable enhancement so that pods stuck in termination for longer than a user-configurable duration are either force-deleted (grace period zero) or orphaned—left in the cluster for the admin to handle—while Grove treats them as gone for availability and reconciliation, allowing the gang to recover.

Which issue(s) this PR fixes:

Fixes #401

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Signed-off-by: kangclzjc <kangz@nvidia.com>

enhance gang termination by force delete or orphan pods

01a6e48

Signed-off-by: kangclzjc <kangz@nvidia.com>

kangclzjc requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners March 2, 2026 11:32

kangclzjc marked this pull request as draft March 2, 2026 11:32

kangclzjc added 2 commits March 3, 2026 13:15

clarify the issue

5ed6df4

Signed-off-by: kangclzjc <kangz@nvidia.com>

modify solution

8db085e

Signed-off-by: kangclzjc <kangz@nvidia.com>

kangclzjc marked this pull request as ready for review March 16, 2026 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GREP] Enhance Gang Termination with Stuck Terminating Policy#466

[GREP] Enhance Gang Termination with Stuck Terminating Policy#466
kangclzjc wants to merge 3 commits intoai-dynamo:mainfrom
kangclzjc:enhance_gang_termination

kangclzjc commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kangclzjc commented Mar 2, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant