[FLINK-28648][Kubernetes Operator] Allow session deletion to block on any running job #994

nishita-09 · 2025-07-10T18:43:25Z

What is the purpose of the change

This pull request adds a configuration to allow session cluster cleanup blocking in case unmanaged jobs are present. That is the FlinkDeployment deletion is blocked in presence of FlinkSessionJobs as well as if jobs submitted through CLI are in non_terminal states.

Brief change log

(for example:)

Added getUnmanagedJobs() method to detect CLI-submitted jobs not managed by FlinkSessionJob resources
Modified cleanupInternal() to check for unmanaged jobs when blocking is enabled
Improved error messages and logging.

Verifying this change

This change added tests and can be verified as follows:

Added testGetUnmanagedJobs test for validating getUnmanagedJobs
- This test would validate if the function correctly identifies jobs which are not controlled by SessionJob and are in non-terminal state.
Manually verified the change by running a cluster with 2 JobManagers and submitted a SessionJob as well as CLI Jobs.
- Config: session.block-on-unmanaged-jobs: true
- Deleted flinkDeployment -> Generates Event CleanupFailed in flinkDeployment due to presence of sessionjobs
- Deleted flinkSessionJob -> Generates Event CleanupFailed in flinkDeployment due to presence of unmanaged Jobs -> flinkSessionJob was deleted.
- Cancelled CLI submitted job -> Generates Event Cleanup after ReconcileInterval -> CLI job was cancelled and then the flinkDeployment was cleaned up.

Config: session.block-on-unmanaged-jobs: false
Manually verified the change by running a cluster with 2 JobManagers and submitted a SessionJob as well as CLI Jobs.
Deleted flinkDeployment -> Generates Event CleanupFailed in flinkDeployment due to presence of sessionjobs
Deleted flinkSessionJob -> Generates Event Cleanup inspite of running CLI jobs being present. -> SessionJob is deleted , followed by flinkDeployment .

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: yes

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? not documented

…ion Cluster

…configuration

nishita-09 · 2025-07-13T20:19:34Z

@gyfora The build was failing due to missing documentation about added configuration. I have added those in the new commit now.

gyfora · 2025-07-14T06:17:00Z

...t/java/org/apache/flink/kubernetes/operator/reconciler/deployment/SessionReconcilerTest.java

+        var context = TestUtils.createContextWithReadyFlinkDeployment(kubernetesClient);
+        var resourceContext = getResourceContext(deployment, context);
+
+        // Use reflection to access the private getUnmanagedJobs method


We should not use reflection for this, we can make the method protected with the @visiblefor testing annotation

.../main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/SessionReconciler.java

gyfora · 2025-07-14T07:13:31Z

.../main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/SessionReconciler.java

+     * the Flink cluster but are not managed by FlinkSessionJob resources.
+     */
+    private Set<JobID> getUnmanagedJobs(
+            FlinkResourceContext<FlinkDeployment> ctx, Set<FlinkSessionJob> sessionJobs) {


I don't think that we need to pass sessionJobs here. if the flag is enabled, any running job should simply block it. We can simplify this logic a lot

We can simply replace this method with something like getNonTerminalJobIds() or boolean anyNonTerminalJobs()

That would be enough for this feature.

Sure, will simplify this further. Thanks for the review

@gyfora I have pushed another commit to address the comments here. I have stuck with getNonTerminalJobIds() to ensure the Event contains the list of job IDs that are not terminated for better observability for the user.

gyfora · 2025-07-14T07:14:05Z

.../main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/SessionReconciler.java

+        LOG.info(
+                "Starting unmanaged job detection for session cluster: {}",
+                ctx.getResource().getMetadata().getName());


this should be on debug level, also no need to include resource name/info in the log message. Its in the MDC already

gyfora · 2025-07-14T07:17:28Z

.../main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/SessionReconciler.java

+                LOG.warn(error);
+                if (eventRecorder.triggerEvent(
+                        deployment,
+                        EventRecorder.Type.Warning,
+                        EventRecorder.Reason.CleanupFailed,
+                        EventRecorder.Component.Operator,
+                        error,
+                        ctx.getKubernetesClient())) {
+                    LOG.warn(error);


you are logging the error twice, you don't need to log it at all as the event triggering already logs it.

@VisibleForTesting

…getNonTerminated + annotated @VisibleForTesting

@VisibleForTesting

…getNonTerminated + annotated @VisibleForTesting

nishita-09 · 2025-07-14T10:56:36Z

@gyfora I ran these tests on my local, they seem to be passing. I am not sure what is causing the issue here. Can you have a look at it if possible?

gyfora

Added 2 minor comments, otherwise looks good!

gyfora · 2025-07-21T09:39:43Z

.../main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/SessionReconciler.java

+                if (eventRecorder.triggerEvent(
+                        deployment,
+                        EventRecorder.Type.Warning,
+                        EventRecorder.Reason.CleanupFailed,
+                        EventRecorder.Component.Operator,
+                        error,
+                        ctx.getKubernetesClient())) {
+                    LOG.warn(error);
+                }


we can remove the if branch and the logging. Event triggering already creates logs we don't need both I think

@gyfora I see this for sessionjob event as well , should we remove here too.

gyfora · 2025-07-21T09:41:25Z

.../main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/SessionReconciler.java

+        var conf = ctx.getDeployConfig(ctx.getResource().getSpec());
+        ctx.getFlinkService()
+                .deleteClusterDeployment(
+                        deployment.getMetadata(), deployment.getStatus(), conf, true);


I think there is a bug in this existing logic, instead of getting the deployconfig, here we should just use ctx.getObserveConfig()

nishita-09 · 2025-07-21T10:11:26Z

Added 2 minor comments, otherwise looks good!

@gyfora Thank you for reviewing, I have made the changes. Please do trigger the workflows on your end. Also do let me know if we should also remove the LOG.warn() statement in sessionjob section too?

gyfora · 2025-07-21T11:44:01Z

Added 2 minor comments, otherwise looks good!

@gyfora Thank you for reviewing, I have made the changes. Please do trigger the workflows on your end. Also do let me know if we should also remove the LOG.warn() statement in sessionjob section too?

We can leave it as is for now :)

[FLINK-28648][Flink-Kubernetes-Operator] Fix Cleanup Process for Sess…

2c2515c

…ion Cluster

nishita-09 changed the title ~~[FLINK-28648][operator] Fix Cleanup Process for Session Cluster~~ [FLINK-28648][operator] Allow session deletion to block on any running job Jul 10, 2025

nishita-09 changed the title ~~[FLINK-28648][operator] Allow session deletion to block on any running job~~ [FLINK-28648][Kubernetes Operator] Allow session deletion to block on any running job Jul 10, 2025

nishita-pattanayak added 3 commits July 11, 2025 16:02

[FLINK-28648][Flink-Kubernetes-Operator] Fix Cleanup Process for Sess…

9fb1bb6

…ion Cluster

[FLINK-28648][Flink-Kubernetes-Operator] Fix Cleanup Process for Sess…

8a12352

…ion Cluster

[FLINK-28648][Flink-Kubernetes-Operator] Add java docs for the added …

8ed2ea4

…configuration

gyfora requested changes Jul 14, 2025

View reviewed changes

nishita-pattanayak added 2 commits July 14, 2025 14:27

[FLINK-28648][Flink-Kubernetes-Operator] Changed getUnManagedJobs -> …

ff26986

…getNonTerminated + annotated @VisibleForTesting

[FLINK-28648][Flink-Kubernetes-Operator] Changed getUnManagedJobs -> …

07d5c1a

…getNonTerminated + annotated @VisibleForTesting

nishita-09 requested a review from gyfora July 14, 2025 09:30

[FLINK-28648][Flink-Kubernetes-Operator] Cleanup Comments

7175b2b

gyfora reviewed Jul 21, 2025

View reviewed changes

[FLINK-28648][Kubernetes Operator] Addressed minor comments

fb9776e

nishita-09 requested a review from gyfora July 21, 2025 10:09

gyfora approved these changes Jul 21, 2025

View reviewed changes

gyfora merged commit ef02fa8 into apache:main Jul 21, 2025
121 checks passed

[FLINK-28648][Kubernetes Operator] Allow session deletion to block on any running job #994

[FLINK-28648][Kubernetes Operator] Allow session deletion to block on any running job #994

Uh oh!

Conversation

nishita-09 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

nishita-09 commented Jul 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishita-09 commented Jul 14, 2025

Uh oh!

gyfora left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishita-09 commented Jul 21, 2025

Uh oh!

gyfora commented Jul 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nishita-09 commented Jul 10, 2025 •

edited

Loading