-
Notifications
You must be signed in to change notification settings - Fork 48
[SPARK-52915] Support TTL for Spark apps #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -293,6 +293,8 @@ applicationTolerations: | |
| resourceRetainPolicy: OnFailure | ||
| # Secondary resources would be garbage collected 10 minutes after app termination | ||
| resourceRetainDurationMillis: 600000 | ||
| # Garbage collect the SparkApplication custom resource itself 30 minutes after termination | ||
| ttlAfterStopMillis: 1800000 | ||
| ``` | ||
|
|
||
| to avoid operator attempt to delete driver pod and driver resources if app fails. Similarly, | ||
|
|
@@ -302,7 +304,54 @@ possible to configure `resourceRetainDurationMillis` to define the maximal retai | |
| these resources. Note that this applies only to operator-created resources (driver pod, SparkConf | ||
| configmap .etc). You may also want to tune `spark.kubernetes.driver.service.deleteOnTermination` | ||
| and `spark.kubernetes.executor.deleteOnTermination` to control the behavior of driver-created | ||
| resources. | ||
| resources. `ttlAfterStopMillis` controls the garbage collection behavior at the SparkApplication | ||
| level after it stops. When set to a non-negative value, Spark operator would garbage collect the | ||
| application (and therefore all its associated resources) after given timeout. If the application | ||
| is configured to restart, `resourceRetainPolicy`, `resourceRetainDurationMillis` and | ||
| `ttlAfterStopMillis` would be applied only to the last attempt. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this mean, @jiangzho ?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read the code. So,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah sorry for the slightly misleading statement, it actually refers to another perspective.
For example, if I do configure and my app ends up with status like The retain policy only takes effect after state Thanks for calling out the |
||
|
|
||
| For example, if an app with below configuration: | ||
|
|
||
| ```yaml | ||
| applicationTolerations: | ||
| restartConfig: | ||
| restartPolicy: OnFailure | ||
| maxRestartAttempts: 1 | ||
| resourceRetainPolicy: Always | ||
| resourceRetainDurationMillis: 30000 | ||
| ttlAfterStopMillis: 60000 | ||
| ``` | ||
|
|
||
| ends up with status like: | ||
|
|
||
| ```yaml | ||
| status: | ||
| #... the 1st attempt | ||
| "5": | ||
| currentStateSummary: Failed | ||
| "6": | ||
| currentStateSummary: ScheduledToRestart | ||
| # ...the 2nd attempt | ||
| "11": | ||
| currentStateSummary: Succeeded | ||
| "12": | ||
| currentStateSummary: TerminatedWithoutReleaseResources | ||
| ``` | ||
|
|
||
| The retain policy only takes effect after the final state `12`. Secondary resources are always | ||
| released between attempts between `5` and `6`. TTL would be calculated based on the last state as | ||
| well. | ||
|
|
||
| | Field | Type | Default Value | Description | | ||
| |-----------------------------------------------------------|-----------------------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | .spec.applicationTolerations.resourceRetainPolicy | `Always` / `OnFailure` / `Never` | Never | Configure operator to delete / retain secondary resources for an app after it terminates. | | ||
| | .spec.applicationTolerations.resourceRetainDurationMillis | integer | -1 | Time to wait in milliseconds for releasing **secondary resources** after termination. Setting to negative value would disable the retention duration check for secondary resources after termination. | | ||
| | .spec.applicationTolerations.ttlAfterStopMillis | integer | -1 | Time-to-live in milliseconds for SparkApplication and **all its associated secondary resources**. If set to a negative value, the application would be retained and not be garbage collected by operator. | | ||
|
|
||
| Note that `ttlAfterStopMillis` applies to the app as well as its secondary resources. If both | ||
| `resourceRetainDurationMillis` and `ttlAfterStopMillis` are set to non-negative value and the | ||
| latter is smaller, then it takes higher precedence: operator would remove all resources related | ||
| to this app after `ttlAfterStopMillis`. | ||
|
|
||
| ## Spark Cluster | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.k8s.operator.spec; | ||
|
|
||
| import static org.junit.jupiter.api.Assertions.*; | ||
|
|
||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| class ApplicationTolerationsTest { | ||
| private final ApplicationTolerations withRetainDurationOnly = | ||
| ApplicationTolerations.builder().resourceRetainDurationMillis(10L).build(); | ||
| private final ApplicationTolerations withTTLOnly = | ||
| ApplicationTolerations.builder().ttlAfterStopMillis(10L).build(); | ||
| private final ApplicationTolerations withNeitherRetainDurationNorTtl = | ||
| ApplicationTolerations.builder().build(); | ||
| private final ApplicationTolerations withRetainDurationGreaterThanTtl = | ||
| ApplicationTolerations.builder() | ||
| .resourceRetainDurationMillis(20L) | ||
| .ttlAfterStopMillis(10L) | ||
| .build(); | ||
| private final ApplicationTolerations withRetainDurationShorterThanTtl = | ||
| ApplicationTolerations.builder() | ||
| .resourceRetainDurationMillis(10L) | ||
| .ttlAfterStopMillis(20L) | ||
| .build(); | ||
|
|
||
| @Test | ||
| void computeEffectiveRetainDurationMillis() { | ||
| assertEquals(10L, withRetainDurationOnly.computeEffectiveRetainDurationMillis()); | ||
| assertEquals(10L, withTTLOnly.computeEffectiveRetainDurationMillis()); | ||
| assertEquals(-1, withNeitherRetainDurationNorTtl.computeEffectiveRetainDurationMillis()); | ||
| assertEquals(10L, withRetainDurationGreaterThanTtl.computeEffectiveRetainDurationMillis()); | ||
| assertEquals(10L, withRetainDurationShorterThanTtl.computeEffectiveRetainDurationMillis()); | ||
| } | ||
|
|
||
| @Test | ||
| void isRetainDurationEnabled() { | ||
| assertTrue(withRetainDurationOnly.isRetainDurationEnabled()); | ||
| assertTrue(withTTLOnly.isRetainDurationEnabled()); | ||
| assertFalse(withNeitherRetainDurationNorTtl.isRetainDurationEnabled()); | ||
| assertTrue(withRetainDurationGreaterThanTtl.isRetainDurationEnabled()); | ||
| assertTrue(withRetainDurationShorterThanTtl.isRetainDurationEnabled()); | ||
| } | ||
|
|
||
| @Test | ||
| void isTTLEnabled() { | ||
| assertFalse(withRetainDurationOnly.isTTLEnabled()); | ||
| assertTrue(withTTLOnly.isTTLEnabled()); | ||
| assertFalse(withNeitherRetainDurationNorTtl.isTTLEnabled()); | ||
| assertTrue(withRetainDurationGreaterThanTtl.isTTLEnabled()); | ||
| assertTrue(withRetainDurationShorterThanTtl.isTTLEnabled()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is
-1in the code, isn't it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. This is only an example value placed in this snippet. I can add a chart for the actual default values.