Skip to content

Commit 38cee0e

Browse files
committed
Excluding the e2e B/G test execution from older Flink versions
Updated e2e yaml specs for Blue/Green tests Relocating the Blue/Green configuration map. Cleaning up and refactoring test code. Handling errors when fetching a savepoint. Handling exceptions when triggering a savepoint. Adding an error property to the FlinkBlueGreenDeploymentStatus to capture failure descriptions. Adding extra savepoint "carry over" feature Addressing edge error cases when patching an existing FlinkDeployment resource. Addressing PR comments More consistent reconcile result (UpdateControl) handling. Clearer comments Consolidated and organized the B/G utility methods. Taking savepoint also for LAST-STATE, removed the last checkpoint usage. Adjusted the semantics of the Diff class. PR comments. If a spec change comes in mid transition, we apply it right away. Removing redundant BlueGreenDiffType cases Addressing PR comments. Corrected abort/delay logic. Added the e2e tests to ci.yml. Missing Licenses. Improving/adding E2E tests for blue/green deployments. Checkstyle fixes. Updated unit test to assert Savepointing. Checkstyle fixes Adding support for Savepointing before transItion in the case of UpgradeMode.SAVEPOINT Optimized the B/G unit tests Triggering a full transition only when needed, otherwise just patch the child FlinkDeployment. Unit test added and simplified assertions. Optimizing the State Handling Introducing a Blue/Green State Machine Refactoring for clarity (added BlueGreenTransitionContext) - Refactoring (splitting) the Blue/Green controller logic from the Controller to a State Machine and Util methods. - Created a comparator for BlueGreenDeploymentSpec Added Blue/Green Deployments E2E test Optimized/simplified the reconciliation logic for first deployments. Clearer log statements. Fixing configOption default value management and log message formatting [FLINK-37515] FLIP-503: Basic support for Blue/Green deployments
1 parent b03ebae commit 38cee0e

File tree

41 files changed

+15377
-5
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+15377
-5
lines changed

.github/workflows/ci.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,8 @@ jobs:
172172
- test_flink_operator_ha.sh
173173
- test_snapshot.sh
174174
- test_batch_job.sh
175+
- test_bluegreen_laststate.sh
176+
- test_bluegreen_stateless.sh
175177
exclude:
176178
- flink-version: v1_16
177179
test: test_autoscaler.sh
@@ -187,6 +189,10 @@ jobs:
187189
test: test_snapshot.sh
188190
- flink-version: v1_16
189191
test: test_batch_job.sh
192+
- flink-version: v1_16
193+
test: test_bluegreen_laststate.sh
194+
- flink-version: v1_16
195+
test: test_bluegreen_stateless.sh
190196
- flink-version: v1_17
191197
test: test_dynamic_config.sh
192198
- flink-version: v1_17
@@ -197,6 +203,10 @@ jobs:
197203
test: test_snapshot.sh
198204
- flink-version: v1_17
199205
test: test_batch_job.sh
206+
- flink-version: v1_17
207+
test: test_bluegreen_laststate.sh
208+
- flink-version: v1_17
209+
test: test_bluegreen_stateless.sh
200210
- flink-version: v1_18
201211
test: test_dynamic_config.sh
202212
- flink-version: v1_18
@@ -207,8 +217,16 @@ jobs:
207217
test: test_snapshot.sh
208218
- flink-version: v1_18
209219
test: test_batch_job.sh
220+
- flink-version: v1_18
221+
test: test_bluegreen_laststate.sh
222+
- flink-version: v1_18
223+
test: test_bluegreen_stateless.sh
210224
- flink-version: v1_19
211225
test: test_snapshot.sh
226+
- flink-version: v1_19
227+
test: test_bluegreen_laststate.sh
228+
- flink-version: v1_19
229+
test: test_bluegreen_stateless.sh
212230
uses: ./.github/workflows/e2e.yaml
213231
with:
214232
java-version: 17

docs/content/docs/custom-resource/reference.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,23 @@ This serves as a full reference for FlinkDeployment and FlinkSessionJob custom r
5757
| Parameter | Type | Docs |
5858
| ----------| ---- | ---- |
5959

60+
### FlinkBlueGreenDeploymentConfigOptions
61+
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkBlueGreenDeploymentConfigOptions
62+
63+
**Description**: Configuration options to be used by the Flink Blue/Green Deployments.
64+
65+
| Parameter | Type | Docs |
66+
| ----------| ---- | ---- |
67+
68+
### FlinkBlueGreenDeploymentSpec
69+
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkBlueGreenDeploymentSpec
70+
71+
**Description**: Spec that describes a Flink application with blue/green deployment capabilities.
72+
73+
| Parameter | Type | Docs |
74+
| ----------| ---- | ---- |
75+
| template | org.apache.flink.kubernetes.operator.api.spec.FlinkDeploymentTemplateSpec | |
76+
6077
### FlinkDeploymentSpec
6178
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkDeploymentSpec
6279

@@ -78,6 +95,17 @@ This serves as a full reference for FlinkDeployment and FlinkSessionJob custom r
7895
| logConfiguration | java.util.Map<java.lang.String,java.lang.String> | Log configuration overrides for the Flink deployment. Format logConfigFileName -> configContent. |
7996
| mode | org.apache.flink.kubernetes.operator.api.spec.KubernetesDeploymentMode | Deployment mode of the Flink cluster, native or standalone. |
8097

98+
### FlinkDeploymentTemplateSpec
99+
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkDeploymentTemplateSpec
100+
101+
**Description**: Template Spec that describes a Flink application managed by the blue/green controller.
102+
103+
| Parameter | Type | Docs |
104+
| ----------| ---- | ---- |
105+
| metadata | io.fabric8.kubernetes.api.model.ObjectMeta | |
106+
| configuration | java.util.Map<java.lang.String,java.lang.String> | |
107+
| spec | org.apache.flink.kubernetes.operator.api.spec.FlinkDeploymentSpec | |
108+
81109
### FlinkSessionJobSpec
82110
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkSessionJobSpec
83111

@@ -290,6 +318,33 @@ This serves as a full reference for FlinkDeployment and FlinkSessionJob custom r
290318
| UNKNOWN | Checkpoint format unknown, if the checkpoint was not triggered by the operator. |
291319
| description | org.apache.flink.configuration.description.InlineElement | |
292320

321+
### FlinkBlueGreenDeploymentState
322+
**Class**: org.apache.flink.kubernetes.operator.api.status.FlinkBlueGreenDeploymentState
323+
324+
**Description**: Enumeration of the possible states of the blue/green transition.
325+
326+
| Value | Docs |
327+
| ----- | ---- |
328+
| INITIALIZING_BLUE | We use this state while initializing for the first time, always with a "Blue" deployment type. |
329+
| ACTIVE_BLUE | Identifies the system is running normally with a "Blue" deployment type. |
330+
| ACTIVE_GREEN | Identifies the system is running normally with a "Green" deployment type. |
331+
| TRANSITIONING_TO_BLUE | Identifies the system is transitioning from "Green" to "Blue". |
332+
| TRANSITIONING_TO_GREEN | Identifies the system is transitioning from "Blue" to "Green". |
333+
334+
### FlinkBlueGreenDeploymentStatus
335+
**Class**: org.apache.flink.kubernetes.operator.api.status.FlinkBlueGreenDeploymentStatus
336+
337+
**Description**: Last observed status of the Flink Blue/Green deployment.
338+
339+
| Parameter | Type | Docs |
340+
| ----------| ---- | ---- |
341+
| jobStatus | org.apache.flink.kubernetes.operator.api.status.JobStatus | |
342+
| blueGreenState | org.apache.flink.kubernetes.operator.api.status.FlinkBlueGreenDeploymentState | The state of the blue/green transition. |
343+
| lastReconciledSpec | java.lang.String | Last reconciled (serialized) deployment spec. |
344+
| lastReconciledTimestamp | java.lang.String | Timestamp of last reconciliation. |
345+
| abortTimestamp | java.lang.String | Computed from abortGracePeriodMs, timestamp after which the deployment should be aborted. |
346+
| deploymentReadyTimestamp | java.lang.String | Timestamp when the deployment became READY/STABLE. Used to determine when to delete it. |
347+
293348
### FlinkDeploymentReconciliationStatus
294349
**Class**: org.apache.flink.kubernetes.operator.api.status.FlinkDeploymentReconciliationStatus
295350

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
################################################################################
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
################################################################################
18+
19+
apiVersion: flink.apache.org/v1beta1
20+
kind: FlinkBlueGreenDeployment
21+
metadata:
22+
name: basic-bluegreen-example
23+
spec:
24+
configuration:
25+
kubernetes.operator.bluegreen.deployment-deletion.delay: "1s"
26+
template:
27+
spec:
28+
image: flink:1.20
29+
flinkVersion: v1_20
30+
flinkConfiguration:
31+
rest.port: "8081"
32+
execution.checkpointing.interval: "10s"
33+
execution.checkpointing.storage: "filesystem"
34+
state.backend.incremental: "true"
35+
state.checkpoints.dir: "file:///flink-data/checkpoints"
36+
state.savepoints.dir: "file:///flink-data/savepoints"
37+
state.checkpoints.num-retained: "5"
38+
taskmanager.numberOfTaskSlots: "1"
39+
serviceAccount: flink
40+
jobManager:
41+
resource:
42+
memory: 1G
43+
cpu: 1
44+
podTemplate:
45+
spec:
46+
containers:
47+
- name: flink-main-container
48+
volumeMounts:
49+
- mountPath: /flink-data/checkpoints
50+
name: checkpoint-volume
51+
- mountPath: /flink-data/savepoints
52+
name: savepoint-volume
53+
volumes:
54+
- name: checkpoint-volume
55+
hostPath:
56+
# directory location on host
57+
path: /tmp/flink/checkpoints
58+
type: Directory
59+
- name: savepoint-volume
60+
hostPath:
61+
# directory location on host
62+
path: /tmp/flink/savepoints
63+
type: Directory
64+
taskManager:
65+
resource:
66+
memory: 2G
67+
cpu: 1
68+
job:
69+
jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
70+
parallelism: 1
71+
entryClass: org.apache.flink.streaming.examples.statemachine.StateMachineExample
72+
args:
73+
- "--error-rate"
74+
- "0.15"
75+
- "--sleep"
76+
- "30"
77+
upgradeMode: last-state
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
################################################################################
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
################################################################################
18+
19+
apiVersion: flink.apache.org/v1beta1
20+
kind: FlinkBlueGreenDeployment
21+
metadata:
22+
name: basic-bluegreen-example
23+
spec:
24+
configuration:
25+
kubernetes.operator.bluegreen.deployment-deletion.delay: "2s"
26+
template:
27+
spec:
28+
image: flink:1.20
29+
flinkVersion: v1_20
30+
flinkConfiguration:
31+
rest.port: "8081"
32+
taskmanager.numberOfTaskSlots: "1"
33+
serviceAccount: flink
34+
jobManager:
35+
resource:
36+
memory: 1G
37+
cpu: 1
38+
taskManager:
39+
resource:
40+
memory: 2G
41+
cpu: 1
42+
job:
43+
jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
44+
parallelism: 1
45+
entryClass: org.apache.flink.streaming.examples.statemachine.StateMachineExample
46+
args:
47+
- "--error-rate"
48+
- "0.15"
49+
- "--sleep"
50+
- "30"
51+
upgradeMode: stateless
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#!/usr/bin/env bash
2+
################################################################################
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
################################################################################
19+
20+
# This script tests the Flink Blue/Green Deployments support as follows:
21+
# - Create a FlinkBlueGreenDeployment which automatically starts a "Blue" FlinkDeployment
22+
# - Once this setup is stable, we trigger a transition which will create the "Green" FlinkDeployment
23+
# - Once it's stable, verify the "Blue" FlinkDeployment is torn down
24+
# - Perform additional validation(s) before exiting
25+
26+
SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
27+
source "${SCRIPT_DIR}/utils.sh"
28+
29+
CLUSTER_ID="basic-bluegreen-example"
30+
BG_CLUSTER_ID=$CLUSTER_ID
31+
BLUE_CLUSTER_ID="basic-bluegreen-example-blue"
32+
GREEN_CLUSTER_ID="basic-bluegreen-example-green"
33+
34+
APPLICATION_YAML="${SCRIPT_DIR}/data/bluegreen-laststate.yaml"
35+
APPLICATION_IDENTIFIER="flinkbgdep/$CLUSTER_ID"
36+
BLUE_APPLICATION_IDENTIFIER="flinkdep/$BLUE_CLUSTER_ID"
37+
GREEN_APPLICATION_IDENTIFIER="flinkdep/$GREEN_CLUSTER_ID"
38+
TIMEOUT=300
39+
40+
#echo "BG_CLUSTER_ID " $BG_CLUSTER_ID
41+
#echo "BLUE_CLUSTER_ID " $BLUE_CLUSTER_ID
42+
#echo "APPLICATION_IDENTIFIER " $APPLICATION_IDENTIFIER
43+
#echo "BLUE_APPLICATION_IDENTIFIER " $BLUE_APPLICATION_IDENTIFIER
44+
45+
#on_exit cleanup_and_exit "$APPLICATION_YAML" $TIMEOUT $BG_CLUSTER_ID
46+
47+
retry_times 5 30 "kubectl apply -f $APPLICATION_YAML" || exit 1
48+
49+
sleep 1
50+
wait_for_jobmanager_running $BLUE_CLUSTER_ID $TIMEOUT
51+
wait_for_status $BLUE_APPLICATION_IDENTIFIER '.status.lifecycleState' STABLE ${TIMEOUT} || exit 1
52+
wait_for_status $APPLICATION_IDENTIFIER '.status.jobStatus.state' RUNNING ${TIMEOUT} || exit 1
53+
wait_for_status $APPLICATION_IDENTIFIER '.status.blueGreenState' ACTIVE_BLUE ${TIMEOUT} || exit 1
54+
55+
blue_job_id=$(kubectl get -oyaml flinkdep/basic-bluegreen-example-blue | yq '.status.jobStatus.jobId')
56+
57+
echo "Giving a chance for checkpoints to be generated..."
58+
sleep 5
59+
kubectl patch flinkbgdep ${BG_CLUSTER_ID} --type merge --patch '{"spec":{"template":{"spec":{"flinkConfiguration":{"rest.port":"8082","state.checkpoints.num-retained":"51"}}}}}'
60+
61+
wait_for_status $GREEN_APPLICATION_IDENTIFIER '.status.lifecycleState' STABLE ${TIMEOUT} || exit 1
62+
kubectl wait --for=delete deployment --timeout=${TIMEOUT}s --selector="app=${BLUE_CLUSTER_ID}"
63+
wait_for_status $APPLICATION_IDENTIFIER '.status.jobStatus.state' RUNNING ${TIMEOUT} || exit 1
64+
wait_for_status $APPLICATION_IDENTIFIER '.status.blueGreenState' ACTIVE_GREEN ${TIMEOUT} || exit 1
65+
66+
green_initialSavepointPath=$(kubectl get -oyaml $GREEN_APPLICATION_IDENTIFIER | yq '.spec.job.initialSavepointPath')
67+
68+
if [[ $green_initialSavepointPath == '/flink-data/checkpoints/'$blue_job_id* ]]; then
69+
echo 'Green deployment started from the expected initialSavepointPath: ' $green_initialSavepointPath
70+
else
71+
echo 'Unexpected initialSavepointPath: ' $green_initialSavepointPath
72+
exit 1
73+
fi;
74+
75+
echo "Successfully run the Flink Blue/Green Deployments test"
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
#!/usr/bin/env bash
2+
################################################################################
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
################################################################################
19+
20+
# This script tests the Flink Blue/Green Deployments support as follows:
21+
# - Create a FlinkBlueGreenDeployment which automatically starts a "Blue" FlinkDeployment
22+
# - Once this setup is stable, we trigger a transition which will create the "Green" FlinkDeployment
23+
# - Once it's stable, verify the "Blue" FlinkDeployment is torn down
24+
# - Perform additional validation(s) before exiting
25+
26+
SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
27+
source "${SCRIPT_DIR}/utils.sh"
28+
29+
CLUSTER_ID="basic-bluegreen-example"
30+
BG_CLUSTER_ID=$CLUSTER_ID
31+
BLUE_CLUSTER_ID="basic-bluegreen-example-blue"
32+
GREEN_CLUSTER_ID="basic-bluegreen-example-green"
33+
34+
APPLICATION_YAML="${SCRIPT_DIR}/data/bluegreen-stateless.yaml"
35+
APPLICATION_IDENTIFIER="flinkbgdep/$CLUSTER_ID"
36+
BLUE_APPLICATION_IDENTIFIER="flinkdep/$BLUE_CLUSTER_ID"
37+
GREEN_APPLICATION_IDENTIFIER="flinkdep/$GREEN_CLUSTER_ID"
38+
TIMEOUT=300
39+
40+
on_exit cleanup_and_exit "$APPLICATION_YAML" $TIMEOUT $BG_CLUSTER_ID
41+
42+
retry_times 5 30 "kubectl apply -f $APPLICATION_YAML" || exit 1
43+
44+
sleep 1
45+
wait_for_jobmanager_running $BLUE_CLUSTER_ID $TIMEOUT
46+
wait_for_status $BLUE_APPLICATION_IDENTIFIER '.status.lifecycleState' STABLE ${TIMEOUT} || exit 1
47+
wait_for_status $APPLICATION_IDENTIFIER '.status.jobStatus.state' RUNNING ${TIMEOUT} || exit 1
48+
wait_for_status $APPLICATION_IDENTIFIER '.status.blueGreenState' ACTIVE_BLUE ${TIMEOUT} || exit 1
49+
50+
blue_job_id=$(kubectl get -oyaml flinkdep/basic-bluegreen-example-blue | yq '.status.jobStatus.jobId')
51+
52+
echo "PATCHING B/G deployment..."
53+
kubectl patch flinkbgdep ${BG_CLUSTER_ID} --type merge --patch '{"spec":{"template":{"spec":{"flinkConfiguration":{"rest.port":"8082","taskmanager.numberOfTaskSlots":"2"}}}}}'
54+
55+
wait_for_status $GREEN_APPLICATION_IDENTIFIER '.status.lifecycleState' STABLE ${TIMEOUT} || exit 1
56+
kubectl wait --for=delete deployment --timeout=${TIMEOUT}s --selector="app=${BLUE_CLUSTER_ID}"
57+
wait_for_status $APPLICATION_IDENTIFIER '.status.jobStatus.state' RUNNING ${TIMEOUT} || exit 1
58+
wait_for_status $APPLICATION_IDENTIFIER '.status.blueGreenState' ACTIVE_GREEN ${TIMEOUT} || exit 1
59+
60+
echo "Successfully run the Flink Blue/Green Deployments test"

0 commit comments

Comments
 (0)