Skip to content

Conversation

@jiangzho
Copy link
Contributor

@jiangzho jiangzho commented Jul 2, 2025

What changes were proposed in this pull request?

This PR adds support for configuring the maximal retain duration for Spark apps. Working with the resourceRetainPolicy, it enhances the garbage collection mechanism.

Why are the changes needed?

Current resourceRetainPolicy provides flexibility for retain Spark app resources after its terminated. Introducing maximal retain duration would add one protection layer to avoid terminated resources (pods, config maps .etc) from taking quota in cluster.

Does this PR introduce any user-facing change?

New configurable field spec.applicationTolerations.resourceRetainDurationMillis added to SparkApplication CRD

How was this patch tested?

CIs - including new unit test and e2e scenario

Was this patch authored or co-authored using generative AI tooling?

No

@jiangzho jiangzho marked this pull request as draft July 4, 2025 00:20
@jiangzho jiangzho marked this pull request as ready for review July 4, 2025 00:38
@jiangzho jiangzho marked this pull request as draft July 7, 2025 18:26
@jiangzho
Copy link
Contributor Author

jiangzho commented Jul 7, 2025

E2E fails could be a result of parallelism, I'm trying to reproduce this & updating this to draft meanwhile

@jiangzho jiangzho force-pushed the retentionDuration branch 3 times, most recently from 0c6b1af to 344b02d Compare July 7, 2025 22:48
@github-actions github-actions bot added the BUILD label Jul 7, 2025
@jiangzho jiangzho force-pushed the retentionDuration branch from 344b02d to facd46d Compare July 7, 2025 22:57
@jiangzho jiangzho marked this pull request as ready for review July 7, 2025 23:07
@jiangzho
Copy link
Contributor Author

jiangzho commented Jul 7, 2025

The e2e tests have been fixed - underlying reason is my lack of updating the staged CRD yaml file for the new field.

I'm thinking of some future options (that can be dealt with separately from this JIRA)

  • Add a developer guide doc reminding that the staged YAML needs to be updated manually, and/or
  • Have another PRB that fails when inconsistency is detected between the staged vs. the generated yaml, and/or
  • Stage the old (v1alpha1/v1beta1) versions only, then merge them with the latests generated version (v1) & move the merged version to helm chart

For now, updating the yaml file can unblock tests for this feature.

type: integer
type: object
resourceRetainDurationMillis:
type: integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will going to happen for the old CRDs (v1alpha and v1beta)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first commit introduces the feature to v1 only and usee would need to upgrade to v1.

I do agree that technically our operator still supports all versions and this can be introduced to all versions. I'll upgrade all them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied the same to all versions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no, it seems that you misunderstood the comment, @jiangzho .

They are immutable. You should not change them.

What I asked is that the real time behavior inside K8s cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, we load the old CRDs and store in v1 format. I was wondering what was the default stored value for the old CRDs.

Copy link
Contributor Author

@jiangzho jiangzho Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the confusion - I'll ensure a default value for them to keep the behavior consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default -1 would give unlimited retention duration for old custom resources, as previously

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for confirming, @jiangzho .

to avoid operator attempt to delete driver pod and driver resources if app fails. Similarly,
if resourceRetentionPolicy is set to `Always`, operator would not delete driver resources
when app ends. Note that this applies only to operator-created resources (driver pod, SparkConf
if resourceRetainPolicy is set to `Always`, operator would not delete driver resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. Please handle this typo-fix PR independently, resourceRetentionPolicy -> resourceRetainPolicy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unrelated typo fixes and raised #284

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged #284 .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example to show how we gracefully add a new field. Thank you for proposing this.

…cation resources

### What changes were proposed in this pull request?

This PR adds support for configuring the maximal retain duration for Spark apps. Working with the resourceRetainPolicy, it enhances the garbage collection mechanism.

### Why are the changes needed?

Current resourceRetainPolicy provides flexibility for retain Spark app resources after its terminated. Introducing maximal retain duration would add one protection layer to avoid terminated resources (pods, config maps .etc) from taking quota in cluster.

### Does this PR introduce any user-facing change?

New configurable field spec.applicationTolerations.resourceRetainDurationMillis added to SparkApplication CRD

### How was this patch tested?

CIs - including new unit test and e2e scenario

### Was this patch authored or co-authored using generative AI tooling?

No
@jiangzho jiangzho force-pushed the retentionDuration branch from facd46d to 49a9cea Compare July 8, 2025 22:35
* introduce the new field to both v1alpha1 and v1beta1
* remove unrelated typo fix
* remove scala version from new e2e
* update image for new e2e scenario
* style fix
@jiangzho jiangzho force-pushed the retentionDuration branch from 1b295ac to e898a4e Compare July 10, 2025 18:35
@dongjoon-hyun
Copy link
Member

To @jiangzho , I'm waiting for your PR for this comment.

@dongjoon-hyun
Copy link
Member

cc @peter-toth

@jiangzho jiangzho marked this pull request as draft July 11, 2025 23:45
@jiangzho jiangzho force-pushed the retentionDuration branch from 8f2c589 to 72a6bfc Compare July 12, 2025 00:09
@jiangzho jiangzho force-pushed the retentionDuration branch from 72a6bfc to 8f50680 Compare July 12, 2025 00:13
@jiangzho jiangzho marked this pull request as ready for review July 12, 2025 00:23
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thank you, @jiangzho .

@dongjoon-hyun
Copy link
Member

Merged to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants