|
| 1 | +--- |
| 2 | +title: "Snapshots" |
| 3 | +weight: 3 |
| 4 | +type: docs |
| 5 | +aliases: |
| 6 | + - /custom-resource/snapshots.html |
| 7 | +--- |
| 8 | +<!-- |
| 9 | +Licensed to the Apache Software Foundation (ASF) under one |
| 10 | +or more contributor license agreements. See the NOTICE file |
| 11 | +distributed with this work for additional information |
| 12 | +regarding copyright ownership. The ASF licenses this file |
| 13 | +to you under the Apache License, Version 2.0 (the |
| 14 | +"License"); you may not use this file except in compliance |
| 15 | +with the License. You may obtain a copy of the License at |
| 16 | +
|
| 17 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 18 | +
|
| 19 | +Unless required by applicable law or agreed to in writing, |
| 20 | +software distributed under the License is distributed on an |
| 21 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 22 | +KIND, either express or implied. See the License for the |
| 23 | +specific language governing permissions and limitations |
| 24 | +under the License. |
| 25 | +--> |
| 26 | + |
| 27 | +# Snapshots |
| 28 | + |
| 29 | +To create, list and delete snapshots you can use the custom resource called FlinkStateSnapshot. |
| 30 | +The operator will use the same controller flow as in the case of FlinkDeployment and FlinkSessionJob to trigger the savepoint/checkpoint and observe its status. |
| 31 | + |
| 32 | +This feature deprecates the old `savepointInfo` and `checkpointInfo` fields found in the Flink resource CR status, alongside with spec fields `initialSavepointPath`, `savepointTriggerNonce` and `checkpointTriggerNonce`. |
| 33 | +It is enabled by default using the configuration option `kubernetes.operator.snapshot.resource.enabled`. |
| 34 | +If you set this to false, the operator will keep using the deprecated status fields to track snapshots. |
| 35 | + |
| 36 | +## Overview |
| 37 | + |
| 38 | +To create a savepoint or checkpoint, exactly one of the spec fields `savepoint` or `checkpoint` must present. |
| 39 | +Furthermore, in case of a savepoint you can signal to the operator that the savepoint already exists using the `alreadyExists` field, and the operator will mark it as a successful snapshot in the next reconciliation phase. |
| 40 | + |
| 41 | +You can also instruct the Operator to start a new FlinkDeployment/FlinkSessionJob from an existing snapshot CR by using `flinkStateSnapshotReference` in the job spec. |
| 42 | + |
| 43 | +## Examples |
| 44 | + |
| 45 | +### Savepoint |
| 46 | + |
| 47 | +```yaml |
| 48 | +apiVersion: flink.apache.org/v1beta1 |
| 49 | +kind: FlinkStateSnapshot |
| 50 | +metadata: |
| 51 | + name: example-savepoint |
| 52 | +spec: |
| 53 | + backoffLimit: 1 # retry count, -1 for infinite, 0 for no retries (default: -1) |
| 54 | + jobReference: |
| 55 | + kind: FlinkDeployment # FlinkDeployment or FlinkSessionJob |
| 56 | + name: example-deployment # name of the resource |
| 57 | + savepoint: |
| 58 | + alreadyExists: false # optional (default: false), if true, the path is considered to already exist and state will be COMPLETED on first reconciliation |
| 59 | + disposeOnDelete: true # optional (default: true), dispose of savepoint when this FlinkStateSnapshot is removed, job needs to be running |
| 60 | + formatType: CANONICAL # optional (default: CANONICAL), format type of savepoint |
| 61 | + path: /flink-data/savepoints-custom # optional (default: job savepoint path) |
| 62 | +``` |
| 63 | +
|
| 64 | +### Checkpoint |
| 65 | +
|
| 66 | +```yaml |
| 67 | +apiVersion: flink.apache.org/v1beta1 |
| 68 | +kind: FlinkStateSnapshot |
| 69 | +metadata: |
| 70 | + name: example-checkpoint |
| 71 | +spec: |
| 72 | + backoffLimit: 1 |
| 73 | + jobReference: |
| 74 | + kind: FlinkDeployment |
| 75 | + name: example-deployment |
| 76 | + checkpoint: {} |
| 77 | +``` |
| 78 | +
|
| 79 | +### Start job from existing snapshot |
| 80 | +
|
| 81 | +```yaml |
| 82 | + job: |
| 83 | + flinkStateSnapshotReference: |
| 84 | + namespace: flink # not required if it's in the same namespace |
| 85 | + name: example-savepoint |
| 86 | +``` |
| 87 | +
|
| 88 | +{{< hint warning >}} |
| 89 | +While it is possible to start a job from a FlinkStateSnapshot with checkpoint type, checkpoint data is owned by Flink, and might be deleted by Flink anytime after triggering the checkpoint. |
| 90 | +{{< /hint >}} |
| 91 | +
|
| 92 | +
|
| 93 | +## Snapshot CR lifecycle |
| 94 | +
|
| 95 | +### Snapshot creation |
| 96 | +
|
| 97 | +When a new FlinkStateSnapshot CR is created, in the first reconciliation phase the operator will trigger the savepoint/checkpoint for the linked deployment via REST API. |
| 98 | +The resulting trigger ID will be added to the CR Status. |
| 99 | +
|
| 100 | +In the next observation phase the operator will check all the in-progress snapshots and query their state. |
| 101 | +If the snapshot was successful, the path will be added to the CR Status. |
| 102 | +
|
| 103 | +If the triggered snapshot is a savepoint and `spec.savepoint.alreadyExists` is set to true, on the first reconciliation the operator will populate its `status` fields with `COMPLETED` state, and copy the savepoint path found in the spec to `status.path`. |
| 104 | + |
| 105 | +### Snapshot errors |
| 106 | + |
| 107 | +If the operator encountered any errors during snapshot observation/reconciliation, the `error` field will be populated in the CR status and the `failures` field will be incremented by 1. |
| 108 | +If the backoff limit specified in the spec is reached, the snapshot will enter a `FAILED` state, and won't be retried. |
| 109 | +If it's not reached, the Operator will continuously back off retrying the snapshot (10s, 20s, 40s, ...). |
| 110 | + |
| 111 | +In case of any error there will also be a new Event generated for the snapshot resource containing the error message. |
| 112 | + |
| 113 | +{{< hint info >}} |
| 114 | +For checkpoints, after the operator has ensured that the checkpoint was successful, it will attempt to fetch its final path via Flink REST API. |
| 115 | +Any errors experienced during this step will generate a Kubernetes event, but will not populate the `error` field, and will mark the checkpoint as `COMPLETED`. |
| 116 | +The `path` field will stay empty though. |
| 117 | +{{< /hint >}} |
| 118 | + |
| 119 | +### Snapshot abandonment |
| 120 | + |
| 121 | +If the referenced Flink job can't be found or is stopped after triggering a snapshot, the state of the snapshot will be `ABANDONED` and won't be retried. |
| 122 | + |
| 123 | +### Savepoint disposal on deletion |
| 124 | + |
| 125 | +In case of savepoints, if `spec.savepoint.disposeOnDelete` is true, the operator will automatically dispose the savepoint on the filesystem when the CR gets deleted. |
| 126 | +This however requires the referenced Flink resource to be alive, as this operation is done using Flink REST API. |
| 127 | + |
| 128 | +This feature is not available for checkpoints. |
| 129 | + |
| 130 | + |
| 131 | +## Triggering snapshots |
| 132 | + |
| 133 | +Upgrade savepoints are triggered automatically by the system during the upgrade process as we have seen in the previous sections. |
| 134 | +In this case, the savepoint path will also be recorded in the `upgradeSnapshotReference` job status field, which the operator will use when restarting the job. |
| 135 | + |
| 136 | +For backup, job forking and other purposes savepoint and checkpoints can be triggered manually or periodically by the operator, however generally speaking these will not be used during upgrades and are not required for the correct operation. |
| 137 | + |
| 138 | +### Manual Checkpoint Triggering |
| 139 | + |
| 140 | +Users can trigger snapshots manually by defining a new (different/random) value to the variable `savepointTriggerNonce` or `checkpointTriggerNonce` in the job specification: |
| 141 | + |
| 142 | +```yaml |
| 143 | + job: |
| 144 | + ... |
| 145 | + savepointTriggerNonce: 123 |
| 146 | + checkpointTriggerNonce: 123 |
| 147 | + ... |
| 148 | +``` |
| 149 | + |
| 150 | +Changing the nonce value will trigger a new snapshot. If FlinkStateSnapshot resources are enabled, a new snapshot CR will be automatically created. |
| 151 | +If disabled, information about pending and last snapshots is stored in the FlinkDeployment/FlinkSessionJob CR status. |
| 152 | + |
| 153 | +### Periodic Snapshot Triggering |
| 154 | + |
| 155 | +The operator also supports periodic snapshot triggering through the following config option which can be configured on a per job level: |
| 156 | + |
| 157 | +```yaml |
| 158 | + flinkConfiguration: |
| 159 | + ... |
| 160 | + kubernetes.operator.periodic.savepoint.interval: 6h |
| 161 | + kubernetes.operator.periodic.checkpoint.interval: 6h |
| 162 | +``` |
| 163 | + |
| 164 | +There is no guarantee on the timely execution of the periodic snapshots as they might be delayed by unhealthy job status or other interfering user operation. |
| 165 | + |
| 166 | +### Snapshot History |
| 167 | + |
| 168 | +The operator automatically keeps track of the snapshot history triggered by upgrade, manual and periodic snapshot operations. |
| 169 | +This is necessary so cleanup can be performed by the operator for old snapshots. |
| 170 | + |
| 171 | +Users can control the cleanup behaviour by specifying a maximum age and maximum count for the savepoint and checkpoint resources in the history. |
| 172 | + |
| 173 | +``` |
| 174 | +kubernetes.operator.savepoint.history.max.age: 24 h |
| 175 | +kubernetes.operator.savepoint.history.max.count: 5 |
| 176 | + |
| 177 | +kubernetes.operator.checkpoint.history.max.age: 24 h |
| 178 | +kubernetes.operator.checkpoint.history.max.count: 5 |
| 179 | +``` |
| 180 | +
|
| 181 | +{{< hint warning >}} |
| 182 | +Checkpoint history history cleanup is only supported if FlinkStateSnapshot resources are enabled. |
| 183 | +This operation will only delete the FlinkStateSnapshot CR, and will never delete any checkpoint data on the filesystem. |
| 184 | +{{< /hint >}} |
| 185 | +
|
| 186 | +{{< hint info >}} |
| 187 | +Savepoint cleanup happens lazily and only when the Flink resource associated with the snapshot is running. |
| 188 | +It is therefore very likely that savepoints live beyond the max age configuration. |
| 189 | +{{< /hint >}} |
| 190 | +
|
| 191 | +To also dispose of savepoint data on savepoint cleanup, set `kubernetes.operator.savepoint.dispose-on-delete: true`. |
| 192 | +This config will set `spec.savepoint.disposeOnDelete` to true for FlinkStateSnapshot CRs created by periodic savepoints and manual ones created using `savepointTriggerNonce`. |
| 193 | +
|
| 194 | +To disable savepoint/checkpoint cleanup by the operator you can set `kubernetes.operator.savepoint.cleanup.enabled: false` and `kubernetes.operator.checkpoint.cleanup.enabled: false`. |
| 195 | +
|
0 commit comments