Skip to content

Commit 0398adc

Browse files
committed
Rebasing from main
Excluding the e2e B/G test execution from older Flink versions Updated e2e yaml specs for Blue/Green tests Relocating the Blue/Green configuration map. Cleaning up and refactoring test code. Handling errors when fetching a savepoint. Handling exceptions when triggering a savepoint. Adding an error property to the FlinkBlueGreenDeploymentStatus to capture failure descriptions. Adding extra savepoint "carry over" feature Addressing edge error cases when patching an existing FlinkDeployment resource. Addressing PR comments More consistent reconcile result (UpdateControl) handling. Clearer comments Consolidated and organized the B/G utility methods. Taking savepoint also for LAST-STATE, removed the last checkpoint usage. Adjusted the semantics of the Diff class. PR comments. If a spec change comes in mid transition, we apply it right away. Removing redundant BlueGreenDiffType cases Addressing PR comments. Corrected abort/delay logic. Added the e2e tests to ci.yml. Missing Licenses. Improving/adding E2E tests for blue/green deployments. Checkstyle fixes. Updated unit test to assert Savepointing. Checkstyle fixes Adding support for Savepointing before transItion in the case of UpgradeMode.SAVEPOINT Optimized the B/G unit tests Triggering a full transition only when needed, otherwise just patch the child FlinkDeployment. Unit test added and simplified assertions. Optimizing the State Handling Introducing a Blue/Green State Machine Refactoring for clarity (added BlueGreenTransitionContext) - Refactoring (splitting) the Blue/Green controller logic from the Controller to a State Machine and Util methods. - Created a comparator for BlueGreenDeploymentSpec Added Blue/Green Deployments E2E test Optimized/simplified the reconciliation logic for first deployments. Clearer log statements. Fixing configOption default value management and log message formatting [FLINK-37515] FLIP-503: Basic support for Blue/Green deployments
1 parent a57579c commit 0398adc

File tree

41 files changed

+15367
-5
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+15367
-5
lines changed

.github/workflows/ci.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
# limitations under the License.
1717
################################################################################
1818

19+
1920
# We need to specify repo related information here since Apache INFRA doesn't differentiate
2021
# between several workflows with the same names while preparing a report for GHA usage
2122
# https://infra-reports.apache.org/#ghactions
@@ -219,11 +220,17 @@ jobs:
219220
- test_flink_operator_ha.sh
220221
- test_snapshot.sh
221222
- test_batch_job.sh
223+
- test_bluegreen_laststate.sh
224+
- test_bluegreen_stateless.sh
222225
exclude:
223226
- mode: standalone
224227
test: test_autoscaler.sh
225228
- flink-version: v1_19
226229
test: test_snapshot.sh
230+
- flink-version: v1_19
231+
test: test_bluegreen_laststate.sh
232+
- flink-version: v1_19
233+
test: test_bluegreen_stateless.sh
227234
uses: ./.github/workflows/e2e.yaml
228235
with:
229236
java-version: 17

docs/content/docs/custom-resource/reference.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,23 @@ This serves as a full reference for FlinkDeployment and FlinkSessionJob custom r
7373
| Parameter | Type | Docs |
7474
| ----------| ---- | ---- |
7575

76+
### FlinkBlueGreenDeploymentConfigOptions
77+
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkBlueGreenDeploymentConfigOptions
78+
79+
**Description**: Configuration options to be used by the Flink Blue/Green Deployments.
80+
81+
| Parameter | Type | Docs |
82+
| ----------| ---- | ---- |
83+
84+
### FlinkBlueGreenDeploymentSpec
85+
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkBlueGreenDeploymentSpec
86+
87+
**Description**: Spec that describes a Flink application with blue/green deployment capabilities.
88+
89+
| Parameter | Type | Docs |
90+
| ----------| ---- | ---- |
91+
| template | org.apache.flink.kubernetes.operator.api.spec.FlinkDeploymentTemplateSpec | |
92+
7693
### FlinkDeploymentSpec
7794
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkDeploymentSpec
7895

@@ -94,6 +111,17 @@ This serves as a full reference for FlinkDeployment and FlinkSessionJob custom r
94111
| logConfiguration | java.util.Map<java.lang.String,java.lang.String> | Log configuration overrides for the Flink deployment. Format logConfigFileName -> configContent. |
95112
| mode | org.apache.flink.kubernetes.operator.api.spec.KubernetesDeploymentMode | Deployment mode of the Flink cluster, native or standalone. |
96113

114+
### FlinkDeploymentTemplateSpec
115+
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkDeploymentTemplateSpec
116+
117+
**Description**: Template Spec that describes a Flink application managed by the blue/green controller.
118+
119+
| Parameter | Type | Docs |
120+
| ----------| ---- | ---- |
121+
| metadata | io.fabric8.kubernetes.api.model.ObjectMeta | |
122+
| configuration | java.util.Map<java.lang.String,java.lang.String> | |
123+
| spec | org.apache.flink.kubernetes.operator.api.spec.FlinkDeploymentSpec | |
124+
97125
### FlinkSessionJobSpec
98126
**Class**: org.apache.flink.kubernetes.operator.api.spec.FlinkSessionJobSpec
99127

@@ -308,6 +336,33 @@ This serves as a full reference for FlinkDeployment and FlinkSessionJob custom r
308336
| UNKNOWN | Checkpoint format unknown, if the checkpoint was not triggered by the operator. |
309337
| description | org.apache.flink.configuration.description.InlineElement | |
310338

339+
### FlinkBlueGreenDeploymentState
340+
**Class**: org.apache.flink.kubernetes.operator.api.status.FlinkBlueGreenDeploymentState
341+
342+
**Description**: Enumeration of the possible states of the blue/green transition.
343+
344+
| Value | Docs |
345+
| ----- | ---- |
346+
| INITIALIZING_BLUE | We use this state while initializing for the first time, always with a "Blue" deployment type. |
347+
| ACTIVE_BLUE | Identifies the system is running normally with a "Blue" deployment type. |
348+
| ACTIVE_GREEN | Identifies the system is running normally with a "Green" deployment type. |
349+
| TRANSITIONING_TO_BLUE | Identifies the system is transitioning from "Green" to "Blue". |
350+
| TRANSITIONING_TO_GREEN | Identifies the system is transitioning from "Blue" to "Green". |
351+
352+
### FlinkBlueGreenDeploymentStatus
353+
**Class**: org.apache.flink.kubernetes.operator.api.status.FlinkBlueGreenDeploymentStatus
354+
355+
**Description**: Last observed status of the Flink Blue/Green deployment.
356+
357+
| Parameter | Type | Docs |
358+
| ----------| ---- | ---- |
359+
| jobStatus | org.apache.flink.kubernetes.operator.api.status.JobStatus | |
360+
| blueGreenState | org.apache.flink.kubernetes.operator.api.status.FlinkBlueGreenDeploymentState | The state of the blue/green transition. |
361+
| lastReconciledSpec | java.lang.String | Last reconciled (serialized) deployment spec. |
362+
| lastReconciledTimestamp | java.lang.String | Timestamp of last reconciliation. |
363+
| abortTimestamp | java.lang.String | Computed from abortGracePeriodMs, timestamp after which the deployment should be aborted. |
364+
| deploymentReadyTimestamp | java.lang.String | Timestamp when the deployment became READY/STABLE. Used to determine when to delete it. |
365+
311366
### FlinkDeploymentReconciliationStatus
312367
**Class**: org.apache.flink.kubernetes.operator.api.status.FlinkDeploymentReconciliationStatus
313368

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
################################################################################
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
################################################################################
18+
19+
apiVersion: flink.apache.org/v1beta1
20+
kind: FlinkBlueGreenDeployment
21+
metadata:
22+
name: basic-bluegreen-example
23+
spec:
24+
configuration:
25+
kubernetes.operator.bluegreen.deployment-deletion.delay: "1s"
26+
template:
27+
spec:
28+
image: flink:1.20
29+
flinkVersion: v1_20
30+
flinkConfiguration:
31+
rest.port: "8081"
32+
execution.checkpointing.interval: "10s"
33+
execution.checkpointing.storage: "filesystem"
34+
state.backend.incremental: "true"
35+
state.checkpoints.dir: "file:///flink-data/checkpoints"
36+
state.savepoints.dir: "file:///flink-data/savepoints"
37+
state.checkpoints.num-retained: "5"
38+
taskmanager.numberOfTaskSlots: "1"
39+
serviceAccount: flink
40+
jobManager:
41+
resource:
42+
memory: 1G
43+
cpu: 1
44+
podTemplate:
45+
spec:
46+
containers:
47+
- name: flink-main-container
48+
volumeMounts:
49+
- mountPath: /flink-data/checkpoints
50+
name: checkpoint-volume
51+
- mountPath: /flink-data/savepoints
52+
name: savepoint-volume
53+
volumes:
54+
- name: checkpoint-volume
55+
hostPath:
56+
# directory location on host
57+
path: /tmp/flink/checkpoints
58+
type: Directory
59+
- name: savepoint-volume
60+
hostPath:
61+
# directory location on host
62+
path: /tmp/flink/savepoints
63+
type: Directory
64+
taskManager:
65+
resource:
66+
memory: 2G
67+
cpu: 1
68+
job:
69+
jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
70+
parallelism: 1
71+
entryClass: org.apache.flink.streaming.examples.statemachine.StateMachineExample
72+
args:
73+
- "--error-rate"
74+
- "0.15"
75+
- "--sleep"
76+
- "30"
77+
upgradeMode: last-state
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
################################################################################
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
################################################################################
18+
19+
apiVersion: flink.apache.org/v1beta1
20+
kind: FlinkBlueGreenDeployment
21+
metadata:
22+
name: basic-bluegreen-example
23+
spec:
24+
configuration:
25+
kubernetes.operator.bluegreen.deployment-deletion.delay: "2s"
26+
template:
27+
spec:
28+
image: flink:1.20
29+
flinkVersion: v1_20
30+
flinkConfiguration:
31+
rest.port: "8081"
32+
taskmanager.numberOfTaskSlots: "1"
33+
serviceAccount: flink
34+
jobManager:
35+
resource:
36+
memory: 1G
37+
cpu: 1
38+
taskManager:
39+
resource:
40+
memory: 2G
41+
cpu: 1
42+
job:
43+
jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
44+
parallelism: 1
45+
entryClass: org.apache.flink.streaming.examples.statemachine.StateMachineExample
46+
args:
47+
- "--error-rate"
48+
- "0.15"
49+
- "--sleep"
50+
- "30"
51+
upgradeMode: stateless
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#!/usr/bin/env bash
2+
################################################################################
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
################################################################################
19+
20+
# This script tests the Flink Blue/Green Deployments support as follows:
21+
# - Create a FlinkBlueGreenDeployment which automatically starts a "Blue" FlinkDeployment
22+
# - Once this setup is stable, we trigger a transition which will create the "Green" FlinkDeployment
23+
# - Once it's stable, verify the "Blue" FlinkDeployment is torn down
24+
# - Perform additional validation(s) before exiting
25+
26+
SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
27+
source "${SCRIPT_DIR}/utils.sh"
28+
29+
CLUSTER_ID="basic-bluegreen-example"
30+
BG_CLUSTER_ID=$CLUSTER_ID
31+
BLUE_CLUSTER_ID="basic-bluegreen-example-blue"
32+
GREEN_CLUSTER_ID="basic-bluegreen-example-green"
33+
34+
APPLICATION_YAML="${SCRIPT_DIR}/data/bluegreen-laststate.yaml"
35+
APPLICATION_IDENTIFIER="flinkbgdep/$CLUSTER_ID"
36+
BLUE_APPLICATION_IDENTIFIER="flinkdep/$BLUE_CLUSTER_ID"
37+
GREEN_APPLICATION_IDENTIFIER="flinkdep/$GREEN_CLUSTER_ID"
38+
TIMEOUT=300
39+
40+
#echo "BG_CLUSTER_ID " $BG_CLUSTER_ID
41+
#echo "BLUE_CLUSTER_ID " $BLUE_CLUSTER_ID
42+
#echo "APPLICATION_IDENTIFIER " $APPLICATION_IDENTIFIER
43+
#echo "BLUE_APPLICATION_IDENTIFIER " $BLUE_APPLICATION_IDENTIFIER
44+
45+
#on_exit cleanup_and_exit "$APPLICATION_YAML" $TIMEOUT $BG_CLUSTER_ID
46+
47+
retry_times 5 30 "kubectl apply -f $APPLICATION_YAML" || exit 1
48+
49+
sleep 1
50+
wait_for_jobmanager_running $BLUE_CLUSTER_ID $TIMEOUT
51+
wait_for_status $BLUE_APPLICATION_IDENTIFIER '.status.lifecycleState' STABLE ${TIMEOUT} || exit 1
52+
wait_for_status $APPLICATION_IDENTIFIER '.status.jobStatus.state' RUNNING ${TIMEOUT} || exit 1
53+
wait_for_status $APPLICATION_IDENTIFIER '.status.blueGreenState' ACTIVE_BLUE ${TIMEOUT} || exit 1
54+
55+
blue_job_id=$(kubectl get -oyaml flinkdep/basic-bluegreen-example-blue | yq '.status.jobStatus.jobId')
56+
57+
echo "Giving a chance for checkpoints to be generated..."
58+
sleep 5
59+
kubectl patch flinkbgdep ${BG_CLUSTER_ID} --type merge --patch '{"spec":{"template":{"spec":{"flinkConfiguration":{"rest.port":"8082","state.checkpoints.num-retained":"51"}}}}}'
60+
61+
wait_for_status $GREEN_APPLICATION_IDENTIFIER '.status.lifecycleState' STABLE ${TIMEOUT} || exit 1
62+
kubectl wait --for=delete deployment --timeout=${TIMEOUT}s --selector="app=${BLUE_CLUSTER_ID}"
63+
wait_for_status $APPLICATION_IDENTIFIER '.status.jobStatus.state' RUNNING ${TIMEOUT} || exit 1
64+
wait_for_status $APPLICATION_IDENTIFIER '.status.blueGreenState' ACTIVE_GREEN ${TIMEOUT} || exit 1
65+
66+
green_initialSavepointPath=$(kubectl get -oyaml $GREEN_APPLICATION_IDENTIFIER | yq '.spec.job.initialSavepointPath')
67+
68+
if [[ $green_initialSavepointPath == '/flink-data/checkpoints/'$blue_job_id* ]]; then
69+
echo 'Green deployment started from the expected initialSavepointPath: ' $green_initialSavepointPath
70+
else
71+
echo 'Unexpected initialSavepointPath: ' $green_initialSavepointPath
72+
exit 1
73+
fi;
74+
75+
echo "Successfully run the Flink Blue/Green Deployments test"
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
#!/usr/bin/env bash
2+
################################################################################
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
################################################################################
19+
20+
# This script tests the Flink Blue/Green Deployments support as follows:
21+
# - Create a FlinkBlueGreenDeployment which automatically starts a "Blue" FlinkDeployment
22+
# - Once this setup is stable, we trigger a transition which will create the "Green" FlinkDeployment
23+
# - Once it's stable, verify the "Blue" FlinkDeployment is torn down
24+
# - Perform additional validation(s) before exiting
25+
26+
SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
27+
source "${SCRIPT_DIR}/utils.sh"
28+
29+
CLUSTER_ID="basic-bluegreen-example"
30+
BG_CLUSTER_ID=$CLUSTER_ID
31+
BLUE_CLUSTER_ID="basic-bluegreen-example-blue"
32+
GREEN_CLUSTER_ID="basic-bluegreen-example-green"
33+
34+
APPLICATION_YAML="${SCRIPT_DIR}/data/bluegreen-stateless.yaml"
35+
APPLICATION_IDENTIFIER="flinkbgdep/$CLUSTER_ID"
36+
BLUE_APPLICATION_IDENTIFIER="flinkdep/$BLUE_CLUSTER_ID"
37+
GREEN_APPLICATION_IDENTIFIER="flinkdep/$GREEN_CLUSTER_ID"
38+
TIMEOUT=300
39+
40+
on_exit cleanup_and_exit "$APPLICATION_YAML" $TIMEOUT $BG_CLUSTER_ID
41+
42+
retry_times 5 30 "kubectl apply -f $APPLICATION_YAML" || exit 1
43+
44+
sleep 1
45+
wait_for_jobmanager_running $BLUE_CLUSTER_ID $TIMEOUT
46+
wait_for_status $BLUE_APPLICATION_IDENTIFIER '.status.lifecycleState' STABLE ${TIMEOUT} || exit 1
47+
wait_for_status $APPLICATION_IDENTIFIER '.status.jobStatus.state' RUNNING ${TIMEOUT} || exit 1
48+
wait_for_status $APPLICATION_IDENTIFIER '.status.blueGreenState' ACTIVE_BLUE ${TIMEOUT} || exit 1
49+
50+
blue_job_id=$(kubectl get -oyaml flinkdep/basic-bluegreen-example-blue | yq '.status.jobStatus.jobId')
51+
52+
echo "PATCHING B/G deployment..."
53+
kubectl patch flinkbgdep ${BG_CLUSTER_ID} --type merge --patch '{"spec":{"template":{"spec":{"flinkConfiguration":{"rest.port":"8082","taskmanager.numberOfTaskSlots":"2"}}}}}'
54+
55+
wait_for_status $GREEN_APPLICATION_IDENTIFIER '.status.lifecycleState' STABLE ${TIMEOUT} || exit 1
56+
kubectl wait --for=delete deployment --timeout=${TIMEOUT}s --selector="app=${BLUE_CLUSTER_ID}"
57+
wait_for_status $APPLICATION_IDENTIFIER '.status.jobStatus.state' RUNNING ${TIMEOUT} || exit 1
58+
wait_for_status $APPLICATION_IDENTIFIER '.status.blueGreenState' ACTIVE_GREEN ${TIMEOUT} || exit 1
59+
60+
echo "Successfully run the Flink Blue/Green Deployments test"

0 commit comments

Comments
 (0)