[SPARK-52648] Add support for maximal retain duration for Spark application resources #268

jiangzho · 2025-07-02T01:42:43Z

What changes were proposed in this pull request?

This PR adds support for configuring the maximal retain duration for Spark apps. Working with the resourceRetainPolicy, it enhances the garbage collection mechanism.

Why are the changes needed?

Current resourceRetainPolicy provides flexibility for retain Spark app resources after its terminated. Introducing maximal retain duration would add one protection layer to avoid terminated resources (pods, config maps .etc) from taking quota in cluster.

Does this PR introduce any user-facing change?

New configurable field spec.applicationTolerations.resourceRetainDurationMillis added to SparkApplication CRD

How was this patch tested?

CIs - including new unit test and e2e scenario

Was this patch authored or co-authored using generative AI tooling?

No

jiangzho · 2025-07-07T18:27:51Z

E2E fails could be a result of parallelism, I'm trying to reproduce this & updating this to draft meanwhile

jiangzho · 2025-07-07T23:13:35Z

The e2e tests have been fixed - underlying reason is my lack of updating the staged CRD yaml file for the new field.

I'm thinking of some future options (that can be dealt with separately from this JIRA)

Add a developer guide doc reminding that the staged YAML needs to be updated manually, and/or
Have another PRB that fails when inconsistency is detected between the staged vs. the generated yaml, and/or
Stage the old (v1alpha1/v1beta1) versions only, then merge them with the latests generated version (v1) & move the merged version to helm chart

For now, updating the yaml file can unblock tests for this feature.

dongjoon-hyun · 2025-07-08T03:22:03Z

build-tools/helm/spark-kubernetes-operator/crds/sparkapplications.spark.apache.org-v1.yaml

                          type: integer
                      type: object
+                    resourceRetainDurationMillis:
+                      type: integer


What will going to happen for the old CRDs (v1alpha and v1beta)?

The first commit introduces the feature to v1 only and usee would need to upgrade to v1.

I do agree that technically our operator still supports all versions and this can be introduced to all versions. I'll upgrade all them.

Applied the same to all versions

Oh, no, it seems that you misunderstood the comment, @jiangzho .

They are immutable. You should not change them.

What I asked is that the real time behavior inside K8s cluster.

To be clear, we load the old CRDs and store in v1 format. I was wondering what was the default stored value for the old CRDs.

sorry for the confusion - I'll ensure a default value for them to keep the behavior consistent

the default -1 would give unlimited retention duration for old custom resources, as previously

Thank you for confirming, @jiangzho .

docs/spark_custom_resources.md

dongjoon-hyun · 2025-07-08T03:27:52Z

docs/spark_custom_resources.md

 to avoid operator attempt to delete driver pod and driver resources if app fails. Similarly,
-if resourceRetentionPolicy is set to `Always`, operator would not delete driver resources
-when app ends. Note that this applies only to operator-created resources (driver pod, SparkConf
+if resourceRetainPolicy is set to `Always`, operator would not delete driver resources


ditto. Please handle this typo-fix PR independently, resourceRetentionPolicy -> resourceRetainPolicy

Removed the unrelated typo fixes and raised #284

I merged #284 .

...or/src/main/java/org/apache/spark/k8s/operator/reconciler/reconcilesteps/AppCleanUpStep.java

tests/e2e/resource-retain-duration/spark-example-retain-duration.yaml

dongjoon-hyun

This is a good example to show how we gracefully add a new field. Thank you for proposing this.

…cation resources ### What changes were proposed in this pull request? This PR adds support for configuring the maximal retain duration for Spark apps. Working with the resourceRetainPolicy, it enhances the garbage collection mechanism. ### Why are the changes needed? Current resourceRetainPolicy provides flexibility for retain Spark app resources after its terminated. Introducing maximal retain duration would add one protection layer to avoid terminated resources (pods, config maps .etc) from taking quota in cluster. ### Does this PR introduce any user-facing change? New configurable field spec.applicationTolerations.resourceRetainDurationMillis added to SparkApplication CRD ### How was this patch tested? CIs - including new unit test and e2e scenario ### Was this patch authored or co-authored using generative AI tooling? No

* introduce the new field to both v1alpha1 and v1beta1 * remove unrelated typo fix * remove scala version from new e2e * update image for new e2e scenario * style fix

dongjoon-hyun · 2025-07-11T20:45:33Z

To @jiangzho , I'm waiting for your PR for this comment.

[SPARK-52648] Add support for maximal retain duration for Spark application resources #268 (comment)

dongjoon-hyun · 2025-07-11T20:45:49Z

cc @peter-toth

dongjoon-hyun

+1, LGTM (Pending CIs). Thank you, @jiangzho .

dongjoon-hyun · 2025-07-12T02:52:28Z

Merged to main.

github-actions bot added API OPERATOR INFRA labels Jul 2, 2025

jiangzho marked this pull request as draft July 4, 2025 00:20

jiangzho marked this pull request as ready for review July 4, 2025 00:38

jiangzho marked this pull request as draft July 7, 2025 18:26

jiangzho force-pushed the retentionDuration branch 3 times, most recently from 0c6b1af to 344b02d Compare July 7, 2025 22:48

github-actions bot added the BUILD label Jul 7, 2025

jiangzho force-pushed the retentionDuration branch from 344b02d to facd46d Compare July 7, 2025 22:57

jiangzho marked this pull request as ready for review July 7, 2025 23:07

dongjoon-hyun mentioned this pull request Jul 8, 2025

[SPARK-52702] Support using latest generated CRD yaml in Helm charts #279

Closed