Skip to content

[GREP] Enhance Gang Termination with Stuck Terminating Policy#466

Open
kangclzjc wants to merge 3 commits intoai-dynamo:mainfrom
kangclzjc:enhance_gang_termination
Open

[GREP] Enhance Gang Termination with Stuck Terminating Policy#466
kangclzjc wants to merge 3 commits intoai-dynamo:mainfrom
kangclzjc:enhance_gang_termination

Conversation

@kangclzjc
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Grove’s gang termination today deletes and recreates PodCliques (and their pods) when MinAvailable is breached for longer than TerminationDelay. In environments where pods are constrained to a topology (e.g. same rack), node or kubelet failures can leave pods stuck in a terminating state: the API server has set deletionTimestamp but the kubelet never completes termination. Those pods are excluded from ready/scheduled counts, so MinAvailable is breached and gang termination runs; however, the stuck pods are still present and can block or complicate cleanup and rescheduling. This GREP proposes a configurable enhancement so that pods stuck in termination for longer than a user-configurable duration are either force-deleted (grace period zero) or orphaned—left in the cluster for the admin to handle—while Grove treats them as gone for availability and reconciliation, allowing the gang to recover.

Which issue(s) this PR fixes:

Fixes #401

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:


Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
@kangclzjc kangclzjc marked this pull request as ready for review March 16, 2026 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance PCLQ gang termination

1 participant