Skip to content

Conversation

@gflarity
Copy link
Contributor

@gflarity gflarity commented Jan 19, 2026

What type of PR is this?

/kind feature
/kind bug
/kind api

What this PR does / why we need it:

This PR fixes and improves gang termination behavior in Grove with several key changes:

  1. Gang termination is now opt-in - terminationDelay no longer defaults to 4 hours. When nil, gang termination is disabled entirely for the PodCliqueSet.

  2. Only ready pods count for breach detection - Previously breach calculation considered scheduled pods; now it only counts ready pods, which more accurately reflects actual workload availability.

  3. PCSG-level terminationDelay override - Individual PodCliqueScalingGroups can now override the PCS-level termination delay, allowing finer-grained control over gang termination timing.

  4. Bug fix: nil pointer when base gang doesn't exist - Fixed a crash when the base gang no longer exists after gang termination.

Which issue(s) this PR fixes:

Fixes #277

Special notes for your reviewer:

  • Start with the design doc at docs/designs/gang-termination.md - it provides a high-level overview of the gang termination feature after these changes. Happy to update the code alongside any design doc feedback.
  • Comprehensive E2E tests have been added covering multiple gang termination scenarios (GT1-GT5)
  • Validation ensures PCSG-level terminationDelay can only be set when PCS-level terminationDelay is set

Does this PR introduce a API change?

Gang termination is now opt-in. The `terminationDelay` field no longer defaults to 4 hours - when not set, gang termination is disabled. To enable gang termination, explicitly set `spec.template.terminationDelay` on PodCliqueSet.  Administrators who were expecting Gang Termination at the PodCliqueSet or PodCliqueScalingGroup levels after 4 hours should update their PodCliqueSets according. 

Added `WasAvailable` status field to PodClique to track whether a workload has ever reached its MinAvailable threshold. Gang termination only triggers for workloads that were previously available, preventing termination during initial creation and startup.

Added `terminationDelay` field to PodCliqueScalingGroupConfig allowing per-PCSG override of the PCS-level termination delay.

Additional documentation e.g., enhancement proposals, usage docs, etc.:

docs/designs/gang-termination.md - Comprehensive design document explaining gang termination behavior, configuration, and debugging

@gflarity gflarity added the kind/bug Categorizes issue or PR as related to a bug. label Jan 19, 2026
@gflarity gflarity added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/enhancement Categorizes issue or PR as related to a new feature, enhancement or improvement labels Jan 19, 2026
Copy link

@shayasoolin shayasoolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)"))
} else if scalingGroupConfig.TerminationDelay.Duration <= 0 {
allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0"))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and another else-if case, to validate that the PCSG-specific termination delay is not greater than the PCS one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://docs.google.com/document/d/11RvFMS1l5RH_FY54G6wd1Q0wSJ2RCobiPRAs2vAhzgM/edit?disco=AAAByqL0jIQ. I discussed with @nvrohanv and @athreesh and while I don't see why someone would do this, we're not going to babysit this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat with @nvrohanv and @athreesh. I'm ambivalent tbh, but we did a final pass on the requirements and this was the call.

Copy link
Contributor

@nvrohanv nvrohanv Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the reason why I don't think we should babysit this is we assume if you choose to do gang termination you know what you are doing. In that scenario I can understand someone being more tolerant if out of 3 pcsg one is down they give it a longer amount of time to recover than if the pcs minAvailable is breached then they have less tolerance because the system is not functional. Since theres a valid use case for setting it to be more and we assume this is a power-user feature I think we should provide full flexibility.

@athreesh
Copy link

LGTM!

  • How will existing users be notified that terminationDelay default changed from 4h to disabled? Should release notes include migration guidance?

  • Do we need user-facing documentation in docs/user-guide/? I guess I see it in operator/api-reference so should be fine

athreesh
athreesh previously approved these changes Jan 21, 2026
@gflarity
Copy link
Contributor Author

  • How will existing users be notified that terminationDelay default changed from 4h to disabled? Should release notes include migration guidance?

Good question, I've just updated the release notes section in the PR description with some guidance. I believe this will be included in the release notes when we cut a release? CC: @sanjaychatterjee.

@gflarity
Copy link
Contributor Author

  • Do we need user-facing documentation in docs/user-guide/? I guess I see it in operator/api-reference so should be fine

Maybe? I'll take a stab at adding something. I believe @nvrohanv is working on a user guide PR, I might just wait for that to merge then follow the style/approach there.

@sanjaychatterjee
Copy link
Collaborator

I will review it this week.

// TerminationDelay overrides the PodCliqueSet-level terminationDelay for this scaling group.
// Can only be set if PodCliqueSetTemplateSpec.TerminationDelay is set (gang termination is enabled).
// When set, this value is used instead of the PodCliqueSet-level terminationDelay for gang termination
// decisions affecting this scaling group's replicas.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need validation that this value should be less than or equal to the PCS one?

allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)"))
} else if scalingGroupConfig.TerminationDelay.Duration <= 0 {
allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0"))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.

@sanjaychatterjee
Copy link
Collaborator

Can we create a GREP for this work? It is important to highlight the reasoning behind this work. You can use issue #277 as the GREP number. For the new GREP template look at PR #362.

@sanjaychatterjee
Copy link
Collaborator

  • How will existing users be notified that terminationDelay default changed from 4h to disabled? Should release notes include migration guidance?

Good question, I've just updated the release notes section in the PR description with some guidance. I believe this will be included in the release notes when we cut a release? CC: @sanjaychatterjee.

Thanks for the release notes. Additionally, we need a GREP and usage docs as well.

@gflarity
Copy link
Contributor Author

gflarity commented Jan 24, 2026

Can we create a GREP for this work? It is important to highlight the reasoning behind this work. You can use issue #277 as the GREP number. For the new GREP template look at PR #362.

Oh I included gang-termination.md as a GREP and used the MNNVL doc as a template. I can take a look at the template PR follow that layout, sure.

@nvrohanv
Copy link
Contributor

  • Do we need user-facing documentation in docs/user-guide/? I guess I see it in operator/api-reference so should be fine

Maybe? I'll take a stab at adding something. I believe @nvrohanv is working on a user guide PR, I might just wait for that to merge then follow the style/approach there.

we do need the documentation but i can take the action item of adding that once this is merged in. In general I think the flow should be

  1. feature gets merged in and api documentation is updated as part of its pr
  2. docs/user-guide gets updated with proper guide in separate pr

nvrohanv
nvrohanv previously approved these changes Jan 24, 2026
@unmarshall unmarshall added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2026
Copy link
Collaborator

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/n reviews

Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`:

- If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS
- If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we choose to have this behavior? If we have TerminationDelay at PCS and PCSG level then it should be respected when defined and termination delay is enabled. The issue is that nil value of PCS termination delay has been taken as an indication that this feature has been enabled or disabled. This IMHO is not so nice and also not very intuitive.

From API design perspective when we define a delay at multiple levels, thus allowing overriding, then if PCS does not define TerminationDelay but this is defined at the PCSG level then it should be honored as long as this feature is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it seems a bit odd to do it that way as well because since minAvailable defaults to 1 (if I remember correctly) then it seems like you should be getting gang semantics throughout. If anything I would lean towards validating that you cant set it on just a pcsg. Something we should discuss more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:

Base PodGang
Comprises of:

  • minAvailable replicas of stand-alone PCLQs
  • minAvailable replicas of PCSG. Within PCSG if constituent PCLQ defines minAvailable then only those many pods.
    This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.

Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.

Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.

Consider the following PCS composition:

  • Standalone PCLQ - router
  • PCSG - comprises of decode & prefill PCLQs

Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.

What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.

Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.

So as soon as we introduce 2 level termination delay we should consider the behavior carefully.

Copy link
Collaborator

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2/n review comments

Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`:

- If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS
- If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:

Base PodGang
Comprises of:

  • minAvailable replicas of stand-alone PCLQs
  • minAvailable replicas of PCSG. Within PCSG if constituent PCLQ defines minAvailable then only those many pods.
    This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.

Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.

Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.

Consider the following PCS composition:

  • Standalone PCLQ - router
  • PCSG - comprises of decode & prefill PCLQs

Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.

What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.

Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.

So as soon as we introduce 2 level termination delay we should consider the behavior carefully.

@gflarity gflarity force-pushed the gang_termination_fix branch from f880b67 to abf2f36 Compare January 29, 2026 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. kind/enhancement Categorizes issue or PR as related to a new feature, enhancement or improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gang Termination Doesn't Work

6 participants