Skip to content

Conversation

@jiangzho
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds support for configuring the ttl for Spark apps after it stops. Working with the resourceRetainPolicy and resourceRetainDurationMillis, it enhances the garbage collection mechanism at the custom resource level.

Why are the changes needed?

Introducing TTL helps user to more effectively configure the garbage collection for apps.

Does this PR introduce any user-facing change?

New configurable field spec.applicationTolerations.ttlAfterStopMillis added to SparkApplication CRD

How was this patch tested?

CIs - including new unit test and revised e2e scenario

Was this patch authored or co-authored using generative AI tooling?

No

### What changes were proposed in this pull request?

This PR adds support for configuring the ttl for Spark apps after it stops. Working with the `resourceRetainPolicy` and `resourceRetainDurationMillis`, it enhances the garbage collection mechanism at the custom resource level.

### Why are the changes needed?

Introducing TTL helps user to more effectively configure the garbage collection for apps.

### Does this PR introduce _any_ user-facing change?

New configurable field spec.applicationTolerations.ttlAfterStopMillis added to SparkApplication CRD

### How was this patch tested?

CIs - including new unit test and revised e2e scenario

### Was this patch authored or co-authored using generative AI tooling?

No
@jiangzho
Copy link
Contributor Author

cc @peter-toth can you please take a look ?

@peter-toth
Copy link
Contributor

@jiangzho , I will try to take a look at this PR tomorrow.

Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familar with this codebase, but the intent makes sense and the implementation looks good to me.

cc @dongjoon-hyun

# Secondary resources would be garbage collected 10 minutes after app termination
resourceRetainDurationMillis: 600000
# Garbage collect the SparkApplication custom resource itself 30 minutes after termination
ttlAfterStopMillis: 1800000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value is -1 in the code, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. This is only an example value placed in this snippet. I can add a chart for the actual default values.

level after it stops. When set to a non-negative value, Spark operator would garbage collect the
application (and therefore all its associated resources) after given timeout. If the application
is configured to restart, `resourceRetainPolicy`, `resourceRetainDurationMillis` and
`ttlAfterStopMillis` would be applied only to the last attempt.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean, @jiangzho ?

  • ttlAfterStopMillis is ignored when resourceRetainDurationMillis configuration exists?
  • Or, max(ttlAfterStopMillis , resourceRetainDurationMillis) is applied?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code. So, Math.min(resourceRetainDurationMillis, ttlAfterStopMillis) is applied, right? I believe we need to rewrite this sentence. Given that the logic, this looks inaccurate to me.

ttlAfterStopMillis would be applied only to the last attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for the slightly misleading statement, it actually refers to another perspective.

  • All 3 fields ttlAfterStopMillis , resourceRetainDurationMillis and resourceRetainPolicy apply to the last attempt only if the app is configured to restart
    • When an app is configured to restart, all resources related to one single attempt(driver, executor, ...) would be released before making the next attempt, regardless of the values configured in above 3 fields - except for the last attempt, when no restart expected.

For example, if I do configure

applicationTolerations: 
  restartConfig:
     restartPolicy: OnFailure
     maxRestartAttempts: 1
  resourceRetainPolicy: Always
  resourceRetainDurationMillis: 30000 
  ttlAfterStopMillis: 60000

and my app ends up with status like

status:
... # the 1st attempt
      "5":
        currentStateSummary: Failed
        lastTransitionTime: "2025-07-30T22:43:15.293414Z"
      "6":
        currentStateSummary: ScheduledToRestart
        lastTransitionTime: "2025-07-30T22:43:15.406645Z"
... # the 2nd attempt
      "11":
        currentStateSummary: Succeeded
        lastTransitionTime: "2025-07-30T22:44:15.503645Z"
      "12":
        currentStateSummary: TerminatedWithoutReleaseResources
        lastTransitionTime: "2025-07-30T22:44:15.503645Z"

The retain policy only takes effect after state 12 - resources are always released between attempts between 5 and 6

Thanks for calling out the Math.min(resourceRetainDurationMillis, ttlAfterStopMillis) prt - forgot to mention it in doc. I'll add it in chart mentioned above

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, @jiangzho . Could you address them first?

@jiangzho
Copy link
Contributor Author

jiangzho commented Aug 1, 2025

@dongjoon-hyun - would you mind give a second round of review please ?

@jiangzho
Copy link
Contributor Author

jiangzho commented Aug 6, 2025

@dongjoon-hyun gentle nudge on this

@peter-toth
Copy link
Contributor

Thanks @jiangzho for the fix and @dongjoon-hyun for the review!

Merged to main (0.5.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants