-
Notifications
You must be signed in to change notification settings - Fork 48
[SPARK-52648] Add support for maximal retain duration for Spark application resources #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
E2E fails could be a result of parallelism, I'm trying to reproduce this & updating this to draft meanwhile |
0c6b1af to
344b02d
Compare
344b02d to
facd46d
Compare
|
The e2e tests have been fixed - underlying reason is my lack of updating the staged CRD yaml file for the new field. I'm thinking of some future options (that can be dealt with separately from this JIRA)
For now, updating the yaml file can unblock tests for this feature. |
| type: integer | ||
| type: object | ||
| resourceRetainDurationMillis: | ||
| type: integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will going to happen for the old CRDs (v1alpha and v1beta)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first commit introduces the feature to v1 only and usee would need to upgrade to v1.
I do agree that technically our operator still supports all versions and this can be introduced to all versions. I'll upgrade all them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied the same to all versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no, it seems that you misunderstood the comment, @jiangzho .
They are immutable. You should not change them.
What I asked is that the real time behavior inside K8s cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, we load the old CRDs and store in v1 format. I was wondering what was the default stored value for the old CRDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the confusion - I'll ensure a default value for them to keep the behavior consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default -1 would give unlimited retention duration for old custom resources, as previously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for confirming, @jiangzho .
| to avoid operator attempt to delete driver pod and driver resources if app fails. Similarly, | ||
| if resourceRetentionPolicy is set to `Always`, operator would not delete driver resources | ||
| when app ends. Note that this applies only to operator-created resources (driver pod, SparkConf | ||
| if resourceRetainPolicy is set to `Always`, operator would not delete driver resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. Please handle this typo-fix PR independently, resourceRetentionPolicy -> resourceRetainPolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the unrelated typo fixes and raised #284
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged #284 .
...or/src/main/java/org/apache/spark/k8s/operator/reconciler/reconcilesteps/AppCleanUpStep.java
Outdated
Show resolved
Hide resolved
tests/e2e/resource-retain-duration/spark-example-retain-duration.yaml
Outdated
Show resolved
Hide resolved
tests/e2e/resource-retain-duration/spark-example-retain-duration.yaml
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good example to show how we gracefully add a new field. Thank you for proposing this.
…cation resources ### What changes were proposed in this pull request? This PR adds support for configuring the maximal retain duration for Spark apps. Working with the resourceRetainPolicy, it enhances the garbage collection mechanism. ### Why are the changes needed? Current resourceRetainPolicy provides flexibility for retain Spark app resources after its terminated. Introducing maximal retain duration would add one protection layer to avoid terminated resources (pods, config maps .etc) from taking quota in cluster. ### Does this PR introduce any user-facing change? New configurable field spec.applicationTolerations.resourceRetainDurationMillis added to SparkApplication CRD ### How was this patch tested? CIs - including new unit test and e2e scenario ### Was this patch authored or co-authored using generative AI tooling? No
facd46d to
49a9cea
Compare
1b295ac to
e898a4e
Compare
|
To @jiangzho , I'm waiting for your PR for this comment. |
|
cc @peter-toth |
8f2c589 to
72a6bfc
Compare
72a6bfc to
8f50680
Compare
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs). Thank you, @jiangzho .
|
Merged to main. |
What changes were proposed in this pull request?
This PR adds support for configuring the maximal retain duration for Spark apps. Working with the resourceRetainPolicy, it enhances the garbage collection mechanism.
Why are the changes needed?
Current resourceRetainPolicy provides flexibility for retain Spark app resources after its terminated. Introducing maximal retain duration would add one protection layer to avoid terminated resources (pods, config maps .etc) from taking quota in cluster.
Does this PR introduce any user-facing change?
New configurable field
spec.applicationTolerations.resourceRetainDurationMillisadded to SparkApplication CRDHow was this patch tested?
CIs - including new unit test and e2e scenario
Was this patch authored or co-authored using generative AI tooling?
No