Skip to content

Commit c8a6ca8

Browse files
authored
[FLINK-35493][snapshot] Add historical cleanup for FlinkStateSnapshot CRs
1 parent d03b816 commit c8a6ca8

File tree

17 files changed

+1041
-312
lines changed

17 files changed

+1041
-312
lines changed

docs/content/docs/custom-resource/snapshots.md

Lines changed: 27 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -168,28 +168,43 @@ There is no guarantee on the timely execution of the periodic snapshots as they
168168
The operator automatically keeps track of the snapshot history triggered by upgrade, manual and periodic snapshot operations.
169169
This is necessary so cleanup can be performed by the operator for old snapshots.
170170

171-
Users can control the cleanup behaviour by specifying a maximum age and maximum count for the savepoint and checkpoint resources in the history.
171+
{{< hint info >}}
172+
Snapshot cleanup happens lazily and only when the Flink resource associated with the snapshot is running.
173+
It is therefore very likely that savepoints live beyond the max age configuration.
174+
{{< /hint >}}
175+
176+
#### Savepoints
172177

178+
Users can control the cleanup behaviour by specifying maximum age and maximum count for savepoints.
179+
If a max age is specified, FlinkStateSnapshot resources of savepoint type will be cleaned up based on the `metadata.creationTimestamp` field.
180+
Snapshots will be cleaned up regardless of their status, but the operator will always keep at least 1 completed FlinkStateSnapshot for every Flink job at all time.
181+
182+
Example configuration:
173183
```
174184
kubernetes.operator.savepoint.history.max.age: 24 h
175185
kubernetes.operator.savepoint.history.max.count: 5
176-
177-
kubernetes.operator.checkpoint.history.max.age: 24 h
178-
kubernetes.operator.checkpoint.history.max.count: 5
179186
```
180187

188+
To also dispose of savepoint data on savepoint cleanup, set `kubernetes.operator.savepoint.dispose-on-delete: true`.
189+
This config will set `spec.savepoint.disposeOnDelete` to true for FlinkStateSnapshot CRs created by upgrade, periodic and manual savepoints created using `savepointTriggerNonce`.
190+
191+
To disable automatic savepoint cleanup by the operator you can set `kubernetes.operator.savepoint.cleanup.enabled: false`.
192+
193+
#### Checkpoints
194+
195+
FlinkStateSnapshots of checkpoint type will always be cleaned up. It's not possible to set max age for them.
196+
The maxmimum amount of checkpoint resources retained will be deteremined by the Flink configuration `state.checkpoints.num-retained`.
197+
181198
{{< hint warning >}}
182-
Checkpoint history history cleanup is only supported if FlinkStateSnapshot resources are enabled.
199+
Checkpoint cleanup is only supported if FlinkStateSnapshot resources are enabled.
183200
This operation will only delete the FlinkStateSnapshot CR, and will never delete any checkpoint data on the filesystem.
184201
{{< /hint >}}
185202

186-
{{< hint info >}}
187-
Savepoint cleanup happens lazily and only when the Flink resource associated with the snapshot is running.
188-
It is therefore very likely that savepoints live beyond the max age configuration.
189-
{{< /hint >}}
190203

191-
To also dispose of savepoint data on savepoint cleanup, set `kubernetes.operator.savepoint.dispose-on-delete: true`.
192-
This config will set `spec.savepoint.disposeOnDelete` to true for FlinkStateSnapshot CRs created by periodic savepoints and manual ones created using `savepointTriggerNonce`.
204+
### Snapshot History For Legacy Savepoints
193205

194-
To disable savepoint/checkpoint cleanup by the operator you can set `kubernetes.operator.savepoint.cleanup.enabled: false` and `kubernetes.operator.checkpoint.cleanup.enabled: false`.
206+
Legacy savepoints found in FlinkDeployment/FlinkSessionJob CRs under the deprecated `status.jobStatus.savepointInfo.savepointHistory` will be cleaned up:
207+
- For max age, it will be cleaned up when its trigger timestamp exceeds max age
208+
- For max count and FlinkStateSnapshot resources **disabled**, it will be cleaned up when `savepointHistory` exceeds max count
209+
- For max count and FlinkStateSnapshot resources **enabled**, it will be cleaned up when `savepointHistory` + number of FlinkStateSnapshot CRs related to the job exceed max count
195210

docs/layouts/shortcodes/generated/dynamic_section.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@
156156
<td><h5>kubernetes.operator.savepoint.cleanup.enabled</h5></td>
157157
<td style="word-wrap: break-word;">true</td>
158158
<td>Boolean</td>
159-
<td>Whether to enable clean up of savepoint history.</td>
159+
<td>Whether to enable clean up of savepoint FlinkStateSnapshot resources. Savepoint state will be disposed of as well if the snapshot CR spec is configured as such. For automatic savepoints this can be configured via the kubernetes.operator.savepoint.dispose-on-delete config option.</td>
160160
</tr>
161161
<tr>
162162
<td><h5>kubernetes.operator.savepoint.dispose-on-delete</h5></td>
@@ -174,13 +174,13 @@
174174
<td><h5>kubernetes.operator.savepoint.history.max.age</h5></td>
175175
<td style="word-wrap: break-word;">1 d</td>
176176
<td>Duration</td>
177-
<td>Maximum age for savepoint history entries to retain. Due to lazy clean-up, the most recent savepoint may live longer than the max age.</td>
177+
<td>Maximum age for savepoint FlinkStateSnapshot resources to retain. Due to lazy clean-up, the most recent savepoint may live longer than the max age.</td>
178178
</tr>
179179
<tr>
180180
<td><h5>kubernetes.operator.savepoint.history.max.count</h5></td>
181181
<td style="word-wrap: break-word;">10</td>
182182
<td>Integer</td>
183-
<td>Maximum number of savepoint history entries to retain.</td>
183+
<td>Maximum number of savepoint FlinkStateSnapshot resources entries to retain.</td>
184184
</tr>
185185
<tr>
186186
<td><h5>kubernetes.operator.savepoint.trigger.grace-period</h5></td>

docs/layouts/shortcodes/generated/kubernetes_operator_config_configuration.html

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -354,7 +354,7 @@
354354
<td><h5>kubernetes.operator.savepoint.cleanup.enabled</h5></td>
355355
<td style="word-wrap: break-word;">true</td>
356356
<td>Boolean</td>
357-
<td>Whether to enable clean up of savepoint history.</td>
357+
<td>Whether to enable clean up of savepoint FlinkStateSnapshot resources. Savepoint state will be disposed of as well if the snapshot CR spec is configured as such. For automatic savepoints this can be configured via the kubernetes.operator.savepoint.dispose-on-delete config option.</td>
358358
</tr>
359359
<tr>
360360
<td><h5>kubernetes.operator.savepoint.dispose-on-delete</h5></td>
@@ -372,25 +372,25 @@
372372
<td><h5>kubernetes.operator.savepoint.history.max.age</h5></td>
373373
<td style="word-wrap: break-word;">1 d</td>
374374
<td>Duration</td>
375-
<td>Maximum age for savepoint history entries to retain. Due to lazy clean-up, the most recent savepoint may live longer than the max age.</td>
375+
<td>Maximum age for savepoint FlinkStateSnapshot resources to retain. Due to lazy clean-up, the most recent savepoint may live longer than the max age.</td>
376376
</tr>
377377
<tr>
378378
<td><h5>kubernetes.operator.savepoint.history.max.age.threshold</h5></td>
379379
<td style="word-wrap: break-word;">(none)</td>
380380
<td>Duration</td>
381-
<td>Maximum age threshold for savepoint history entries to retain.</td>
381+
<td>Maximum age threshold for FlinkStateSnapshot resources to retain.</td>
382382
</tr>
383383
<tr>
384384
<td><h5>kubernetes.operator.savepoint.history.max.count</h5></td>
385385
<td style="word-wrap: break-word;">10</td>
386386
<td>Integer</td>
387-
<td>Maximum number of savepoint history entries to retain.</td>
387+
<td>Maximum number of savepoint FlinkStateSnapshot resources entries to retain.</td>
388388
</tr>
389389
<tr>
390390
<td><h5>kubernetes.operator.savepoint.history.max.count.threshold</h5></td>
391391
<td style="word-wrap: break-word;">(none)</td>
392392
<td>Integer</td>
393-
<td>Maximum number threshold of savepoint history entries to retain.</td>
393+
<td>Maximum number threshold of savepoint FlinkStateSnapshot resources to retain.</td>
394394
</tr>
395395
<tr>
396396
<td><h5>kubernetes.operator.savepoint.trigger.grace-period</h5></td>

docs/layouts/shortcodes/generated/system_advanced_section.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,13 +84,13 @@
8484
<td><h5>kubernetes.operator.savepoint.history.max.age.threshold</h5></td>
8585
<td style="word-wrap: break-word;">(none)</td>
8686
<td>Duration</td>
87-
<td>Maximum age threshold for savepoint history entries to retain.</td>
87+
<td>Maximum age threshold for FlinkStateSnapshot resources to retain.</td>
8888
</tr>
8989
<tr>
9090
<td><h5>kubernetes.operator.savepoint.history.max.count.threshold</h5></td>
9191
<td style="word-wrap: break-word;">(none)</td>
9292
<td>Integer</td>
93-
<td>Maximum number threshold of savepoint history entries to retain.</td>
93+
<td>Maximum number threshold of savepoint FlinkStateSnapshot resources to retain.</td>
9494
</tr>
9595
<tr>
9696
<td><h5>kubernetes.operator.startup.stop-on-informer-error</h5></td>

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/utils/DateTimeUtils.java

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,4 +60,14 @@ public static String kubernetes(Instant instant) {
6060
ZonedDateTime dateTime = instant.atZone(ZoneId.systemDefault());
6161
return dateTime.format(DateTimeFormatter.ISO_INSTANT);
6262
}
63+
64+
/**
65+
* Parses a Kubernetes-compatible datetime.
66+
*
67+
* @param datetime datetime in Kubernetes format
68+
* @return time parsed
69+
*/
70+
public static Instant parseKubernetes(String datetime) {
71+
return Instant.parse(datetime);
72+
}
6373
}

flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkOperatorConfiguration.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ public static FlinkOperatorConfiguration fromConfiguration(Configuration operato
115115
operatorConfig.get(
116116
KubernetesOperatorConfigOptions
117117
.OPERATOR_SAVEPOINT_HISTORY_MAX_AGE_THRESHOLD);
118+
118119
Boolean exceptionStackTraceEnabled =
119120
operatorConfig.get(
120121
KubernetesOperatorConfigOptions.OPERATOR_EXCEPTION_STACK_TRACE_ENABLED);

flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/KubernetesOperatorConfigOptions.java

Lines changed: 39 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -210,35 +210,70 @@ public static String operatorConfigKey(String key) {
210210
.withDescription(
211211
"Whether to enable recovery of missing/deleted jobmanager deployments.");
212212

213+
@Documentation.Section(SECTION_DYNAMIC)
214+
public static final ConfigOption<Boolean> OPERATOR_JOB_SAVEPOINT_DISPOSE_ON_DELETE =
215+
operatorConfig("savepoint.dispose-on-delete")
216+
.booleanType()
217+
.defaultValue(false)
218+
.withDescription(
219+
"Savepoint data for FlinkStateSnapshot resources created by the operator during upgrades and periodic savepoints will be disposed of automatically when the generated Kubernetes resource is deleted.");
220+
221+
@Documentation.Section(SECTION_DYNAMIC)
222+
public static final ConfigOption<SavepointFormatType> OPERATOR_SAVEPOINT_FORMAT_TYPE =
223+
operatorConfig("savepoint.format.type")
224+
.enumType(SavepointFormatType.class)
225+
.defaultValue(SavepointFormatType.DEFAULT)
226+
.withDescription(
227+
"Type of the binary format in which a savepoint should be taken.");
228+
229+
@Documentation.Section(SECTION_DYNAMIC)
230+
public static final ConfigOption<CheckpointType> OPERATOR_CHECKPOINT_TYPE =
231+
operatorConfig("checkpoint.type")
232+
.enumType(CheckpointType.class)
233+
.defaultValue(CheckpointType.FULL)
234+
.withDescription("Type of checkpoint.");
235+
213236
@Documentation.Section(SECTION_DYNAMIC)
214237
public static final ConfigOption<Boolean> OPERATOR_SAVEPOINT_CLEANUP_ENABLED =
215238
operatorConfig("savepoint.cleanup.enabled")
216239
.booleanType()
217240
.defaultValue(true)
218-
.withDescription("Whether to enable clean up of savepoint history.");
241+
.withDescription(
242+
String.format(
243+
"Whether to enable clean up of savepoint FlinkStateSnapshot resources. Savepoint state will be disposed of as well if the snapshot CR spec is configured as such. For automatic savepoints this can be configured via the %s config option.",
244+
OPERATOR_JOB_SAVEPOINT_DISPOSE_ON_DELETE.key()));
219245

220246
@Documentation.Section(SECTION_DYNAMIC)
221247
public static final ConfigOption<Integer> OPERATOR_SAVEPOINT_HISTORY_MAX_COUNT =
222248
operatorConfig("savepoint.history.max.count")
223249
.intType()
224250
.defaultValue(10)
225-
.withDescription("Maximum number of savepoint history entries to retain.");
251+
.withDescription(
252+
"Maximum number of savepoint FlinkStateSnapshot resources entries to retain.");
226253

227254
@Documentation.Section(SECTION_ADVANCED)
228255
public static final ConfigOption<Integer> OPERATOR_SAVEPOINT_HISTORY_MAX_COUNT_THRESHOLD =
229256
ConfigOptions.key(OPERATOR_SAVEPOINT_HISTORY_MAX_COUNT.key() + ".threshold")
230257
.intType()
231258
.noDefaultValue()
232259
.withDescription(
233-
"Maximum number threshold of savepoint history entries to retain.");
260+
"Maximum number threshold of savepoint FlinkStateSnapshot resources to retain.");
234261

235262
@Documentation.Section(SECTION_DYNAMIC)
236263
public static final ConfigOption<Duration> OPERATOR_SAVEPOINT_HISTORY_MAX_AGE =
237264
operatorConfig("savepoint.history.max.age")
238265
.durationType()
239266
.defaultValue(Duration.ofHours(24))
240267
.withDescription(
241-
"Maximum age for savepoint history entries to retain. Due to lazy clean-up, the most recent savepoint may live longer than the max age.");
268+
"Maximum age for savepoint FlinkStateSnapshot resources to retain. Due to lazy clean-up, the most recent savepoint may live longer than the max age.");
269+
270+
@Documentation.Section(SECTION_ADVANCED)
271+
public static final ConfigOption<Duration> OPERATOR_SAVEPOINT_HISTORY_MAX_AGE_THRESHOLD =
272+
ConfigOptions.key(OPERATOR_SAVEPOINT_HISTORY_MAX_AGE.key() + ".threshold")
273+
.durationType()
274+
.noDefaultValue()
275+
.withDescription(
276+
"Maximum age threshold for FlinkStateSnapshot resources to retain.");
242277

243278
@Documentation.Section(SECTION_SYSTEM)
244279
public static final ConfigOption<Boolean> OPERATOR_EXCEPTION_STACK_TRACE_ENABLED =
@@ -280,14 +315,6 @@ public static String operatorConfigKey(String key) {
280315
.withDescription(
281316
"Key-Value pair where key is the REGEX to filter through the exception messages and value is the string to be included in CR status error label field if the REGEX matches. Expected format: headerKey1:headerValue1,headerKey2:headerValue2.");
282317

283-
@Documentation.Section(SECTION_ADVANCED)
284-
public static final ConfigOption<Duration> OPERATOR_SAVEPOINT_HISTORY_MAX_AGE_THRESHOLD =
285-
ConfigOptions.key(OPERATOR_SAVEPOINT_HISTORY_MAX_AGE.key() + ".threshold")
286-
.durationType()
287-
.noDefaultValue()
288-
.withDescription(
289-
"Maximum age threshold for savepoint history entries to retain.");
290-
291318
@Documentation.Section(SECTION_DYNAMIC)
292319
public static final ConfigOption<Map<String, String>> JAR_ARTIFACT_HTTP_HEADER =
293320
operatorConfig("user.artifacts.http.header")
@@ -438,29 +465,6 @@ public static String operatorConfigKey(String key) {
438465
.withDescription(
439466
"Max allowed checkpoint age for initiating last-state upgrades on running jobs. If a checkpoint is not available within the desired age (and nothing in progress) a savepoint will be triggered.");
440467

441-
@Documentation.Section(SECTION_DYNAMIC)
442-
public static final ConfigOption<Boolean> OPERATOR_JOB_SAVEPOINT_DISPOSE_ON_DELETE =
443-
operatorConfig("savepoint.dispose-on-delete")
444-
.booleanType()
445-
.defaultValue(false)
446-
.withDescription(
447-
"Savepoint data for FlinkStateSnapshot resources created by the operator during upgrades and periodic savepoints will be disposed of automatically when the generated Kubernetes resource is deleted.");
448-
449-
@Documentation.Section(SECTION_DYNAMIC)
450-
public static final ConfigOption<SavepointFormatType> OPERATOR_SAVEPOINT_FORMAT_TYPE =
451-
operatorConfig("savepoint.format.type")
452-
.enumType(SavepointFormatType.class)
453-
.defaultValue(SavepointFormatType.DEFAULT)
454-
.withDescription(
455-
"Type of the binary format in which a savepoint should be taken.");
456-
457-
@Documentation.Section(SECTION_DYNAMIC)
458-
public static final ConfigOption<CheckpointType> OPERATOR_CHECKPOINT_TYPE =
459-
operatorConfig("checkpoint.type")
460-
.enumType(CheckpointType.class)
461-
.defaultValue(CheckpointType.FULL)
462-
.withDescription("Type of checkpoint.");
463-
464468
@Documentation.Section(SECTION_ADVANCED)
465469
public static final ConfigOption<Boolean> OPERATOR_HEALTH_PROBE_ENABLED =
466470
operatorConfig("health.probe.enabled")

flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919

2020
import org.apache.flink.api.common.JobStatus;
2121
import org.apache.flink.kubernetes.operator.api.FlinkDeployment;
22+
import org.apache.flink.kubernetes.operator.api.FlinkStateSnapshot;
2223
import org.apache.flink.kubernetes.operator.api.status.FlinkDeploymentStatus;
2324
import org.apache.flink.kubernetes.operator.api.status.JobManagerDeploymentStatus;
2425
import org.apache.flink.kubernetes.operator.exception.DeploymentFailedException;
@@ -31,6 +32,7 @@
3132
import org.apache.flink.kubernetes.operator.service.FlinkResourceContextFactory;
3233
import org.apache.flink.kubernetes.operator.utils.EventRecorder;
3334
import org.apache.flink.kubernetes.operator.utils.EventSourceUtils;
35+
import org.apache.flink.kubernetes.operator.utils.KubernetesClientUtils;
3436
import org.apache.flink.kubernetes.operator.utils.StatusRecorder;
3537
import org.apache.flink.kubernetes.operator.utils.ValidatorUtils;
3638
import org.apache.flink.kubernetes.operator.validation.FlinkResourceValidator;
@@ -49,6 +51,8 @@
4951
import org.slf4j.Logger;
5052
import org.slf4j.LoggerFactory;
5153

54+
import java.util.ArrayList;
55+
import java.util.List;
5256
import java.util.Map;
5357
import java.util.Optional;
5458
import java.util.Set;
@@ -197,9 +201,19 @@ private void handleRecoveryFailed(
197201
@Override
198202
public Map<String, EventSource> prepareEventSources(
199203
EventSourceContext<FlinkDeployment> context) {
200-
return EventSourceInitializer.nameEventSources(
201-
EventSourceUtils.getSessionJobInformerEventSource(context),
202-
EventSourceUtils.getDeploymentInformerEventSource(context));
204+
List<EventSource> eventSources = new ArrayList<>();
205+
eventSources.add(EventSourceUtils.getSessionJobInformerEventSource(context));
206+
eventSources.add(EventSourceUtils.getDeploymentInformerEventSource(context));
207+
208+
if (KubernetesClientUtils.isCrdInstalled(FlinkStateSnapshot.class)) {
209+
eventSources.add(
210+
EventSourceUtils.getStateSnapshotForFlinkResourceInformerEventSource(context));
211+
} else {
212+
LOG.warn(
213+
"Could not initialize informer for snapshots as the CRD has not been installed!");
214+
}
215+
216+
return EventSourceInitializer.nameEventSources(eventSources.toArray(EventSource[]::new));
203217
}
204218

205219
@Override

0 commit comments

Comments
 (0)