-
Notifications
You must be signed in to change notification settings - Fork 35
Gang termination fix #353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Gang termination fix #353
Conversation
shayasoolin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
| allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)")) | ||
| } else if scalingGroupConfig.TerminationDelay.Duration <= 0 { | ||
| allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0")) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and another else-if case, to validate that the PCSG-specific termination delay is not greater than the PCS one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://docs.google.com/document/d/11RvFMS1l5RH_FY54G6wd1Q0wSJ2RCobiPRAs2vAhzgM/edit?disco=AAAByqL0jIQ. I discussed with @nvrohanv and @athreesh and while I don't see why someone would do this, we're not going to babysit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the reason why I don't think we should babysit this is we assume if you choose to do gang termination you know what you are doing. In that scenario I can understand someone being more tolerant if out of 3 pcsg one is down they give it a longer amount of time to recover than if the pcs minAvailable is breached then they have less tolerance because the system is not functional. Since theres a valid use case for setting it to be more and we assume this is a power-user feature I think we should provide full flexibility.
|
LGTM!
|
Good question, I've just updated the release notes section in the PR description with some guidance. I believe this will be included in the release notes when we cut a release? CC: @sanjaychatterjee. |
Maybe? I'll take a stab at adding something. I believe @nvrohanv is working on a user guide PR, I might just wait for that to merge then follow the style/approach there. |
|
I will review it this week. |
b04bbb2 to
0c07d49
Compare
| // TerminationDelay overrides the PodCliqueSet-level terminationDelay for this scaling group. | ||
| // Can only be set if PodCliqueSetTemplateSpec.TerminationDelay is set (gang termination is enabled). | ||
| // When set, this value is used instead of the PodCliqueSet-level terminationDelay for gang termination | ||
| // decisions affecting this scaling group's replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need validation that this value should be less than or equal to the PCS one?
| allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)")) | ||
| } else if scalingGroupConfig.TerminationDelay.Duration <= 0 { | ||
| allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0")) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.
operator/internal/controller/podcliqueset/components/podcliquesetreplica/gangterminate.go
Show resolved
Hide resolved
operator/internal/controller/podcliquescalinggroup/components/podclique/sync.go
Show resolved
Hide resolved
Thanks for the release notes. Additionally, we need a GREP and usage docs as well. |
Oh I included gang-termination.md as a GREP and used the MNNVL doc as a template. I can take a look at the template PR follow that layout, sure. |
we do need the documentation but i can take the action item of adding that once this is merged in. In general I think the flow should be
|
unmarshall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/n reviews
docs/designs/gang-termination.md
Outdated
| Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`: | ||
|
|
||
| - If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS | ||
| - If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we choose to have this behavior? If we have TerminationDelay at PCS and PCSG level then it should be respected when defined and termination delay is enabled. The issue is that nil value of PCS termination delay has been taken as an indication that this feature has been enabled or disabled. This IMHO is not so nice and also not very intuitive.
From API design perspective when we define a delay at multiple levels, thus allowing overriding, then if PCS does not define TerminationDelay but this is defined at the PCSG level then it should be honored as long as this feature is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm it seems a bit odd to do it that way as well because since minAvailable defaults to 1 (if I remember correctly) then it seems like you should be getting gang semantics throughout. If anything I would lean towards validating that you cant set it on just a pcsg. Something we should discuss more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:
Base PodGang
Comprises of:
minAvailablereplicas of stand-alone PCLQsminAvailablereplicas of PCSG. Within PCSG if constituent PCLQ definesminAvailablethen only those many pods.
This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.
Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.
Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.
Consider the following PCS composition:
- Standalone PCLQ - router
- PCSG - comprises of decode & prefill PCLQs
Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.
What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.
Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.
So as soon as we introduce 2 level termination delay we should consider the behavior carefully.
unmarshall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2/n review comments
docs/designs/gang-termination.md
Outdated
| Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`: | ||
|
|
||
| - If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS | ||
| - If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:
Base PodGang
Comprises of:
minAvailablereplicas of stand-alone PCLQsminAvailablereplicas of PCSG. Within PCSG if constituent PCLQ definesminAvailablethen only those many pods.
This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.
Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.
Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.
Consider the following PCS composition:
- Standalone PCLQ - router
- PCSG - comprises of decode & prefill PCLQs
Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.
What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.
Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.
So as soon as we introduce 2 level termination delay we should consider the behavior carefully.
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Geoff Flarity <[email protected]>
f880b67 to
abf2f36
Compare
What type of PR is this?
/kind feature
/kind bug
/kind api
What this PR does / why we need it:
This PR fixes and improves gang termination behavior in Grove with several key changes:
Gang termination is now opt-in -
terminationDelayno longer defaults to 4 hours. When nil, gang termination is disabled entirely for the PodCliqueSet.Only ready pods count for breach detection - Previously breach calculation considered scheduled pods; now it only counts ready pods, which more accurately reflects actual workload availability.
PCSG-level terminationDelay override - Individual PodCliqueScalingGroups can now override the PCS-level termination delay, allowing finer-grained control over gang termination timing.
Bug fix: nil pointer when base gang doesn't exist - Fixed a crash when the base gang no longer exists after gang termination.
Which issue(s) this PR fixes:
Fixes #277
Special notes for your reviewer:
docs/designs/gang-termination.md- it provides a high-level overview of the gang termination feature after these changes. Happy to update the code alongside any design doc feedback.terminationDelaycan only be set when PCS-levelterminationDelayis setDoes this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: