-
Notifications
You must be signed in to change notification settings - Fork 48
[SPARK-52915] Support TTL for Spark apps #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
### What changes were proposed in this pull request? This PR adds support for configuring the ttl for Spark apps after it stops. Working with the `resourceRetainPolicy` and `resourceRetainDurationMillis`, it enhances the garbage collection mechanism at the custom resource level. ### Why are the changes needed? Introducing TTL helps user to more effectively configure the garbage collection for apps. ### Does this PR introduce _any_ user-facing change? New configurable field spec.applicationTolerations.ttlAfterStopMillis added to SparkApplication CRD ### How was this patch tested? CIs - including new unit test and revised e2e scenario ### Was this patch authored or co-authored using generative AI tooling? No
|
cc @peter-toth can you please take a look ? |
|
@jiangzho , I will try to take a look at this PR tomorrow. |
peter-toth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familar with this codebase, but the intent makes sense and the implementation looks good to me.
| # Secondary resources would be garbage collected 10 minutes after app termination | ||
| resourceRetainDurationMillis: 600000 | ||
| # Garbage collect the SparkApplication custom resource itself 30 minutes after termination | ||
| ttlAfterStopMillis: 1800000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is -1 in the code, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. This is only an example value placed in this snippet. I can add a chart for the actual default values.
| level after it stops. When set to a non-negative value, Spark operator would garbage collect the | ||
| application (and therefore all its associated resources) after given timeout. If the application | ||
| is configured to restart, `resourceRetainPolicy`, `resourceRetainDurationMillis` and | ||
| `ttlAfterStopMillis` would be applied only to the last attempt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean, @jiangzho ?
ttlAfterStopMillisis ignored whenresourceRetainDurationMillisconfiguration exists?- Or,
max(ttlAfterStopMillis , resourceRetainDurationMillis)is applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the code. So, Math.min(resourceRetainDurationMillis, ttlAfterStopMillis) is applied, right? I believe we need to rewrite this sentence. Given that the logic, this looks inaccurate to me.
ttlAfterStopMilliswould be applied only to the last attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry for the slightly misleading statement, it actually refers to another perspective.
- All 3 fields
ttlAfterStopMillis,resourceRetainDurationMillisandresourceRetainPolicyapply to the last attempt only if the app is configured to restart- When an app is configured to restart, all resources related to one single attempt(driver, executor, ...) would be released before making the next attempt, regardless of the values configured in above 3 fields - except for the last attempt, when no restart expected.
For example, if I do configure
applicationTolerations:
restartConfig:
restartPolicy: OnFailure
maxRestartAttempts: 1
resourceRetainPolicy: Always
resourceRetainDurationMillis: 30000
ttlAfterStopMillis: 60000
and my app ends up with status like
status:
... # the 1st attempt
"5":
currentStateSummary: Failed
lastTransitionTime: "2025-07-30T22:43:15.293414Z"
"6":
currentStateSummary: ScheduledToRestart
lastTransitionTime: "2025-07-30T22:43:15.406645Z"
... # the 2nd attempt
"11":
currentStateSummary: Succeeded
lastTransitionTime: "2025-07-30T22:44:15.503645Z"
"12":
currentStateSummary: TerminatedWithoutReleaseResources
lastTransitionTime: "2025-07-30T22:44:15.503645Z"
The retain policy only takes effect after state 12 - resources are always released between attempts between 5 and 6
Thanks for calling out the Math.min(resourceRetainDurationMillis, ttlAfterStopMillis) prt - forgot to mention it in doc. I'll add it in chart mentioned above
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, @jiangzho . Could you address them first?
|
@dongjoon-hyun - would you mind give a second round of review please ? |
|
@dongjoon-hyun gentle nudge on this |
|
Thanks @jiangzho for the fix and @dongjoon-hyun for the review! Merged to |
What changes were proposed in this pull request?
This PR adds support for configuring the ttl for Spark apps after it stops. Working with the
resourceRetainPolicyandresourceRetainDurationMillis, it enhances the garbage collection mechanism at the custom resource level.Why are the changes needed?
Introducing TTL helps user to more effectively configure the garbage collection for apps.
Does this PR introduce any user-facing change?
New configurable field spec.applicationTolerations.ttlAfterStopMillis added to SparkApplication CRD
How was this patch tested?
CIs - including new unit test and revised e2e scenario
Was this patch authored or co-authored using generative AI tooling?
No