[FLINK-37370] [Observer] Finished batch jobs throw ReconciliationException and never reach FINISHED in the CR #948

luca-p-castelli · 2025-02-21T17:49:19Z

What is the purpose of the change

This pull request fixes a bug where in the observation phase, the operator tries to observe savepoint information for batch jobs and fails since checkpointing is not enabled for batch jobs.

More information in: https://issues.apache.org/jira/browse/FLINK-37370

Brief change log

Modifies JobDetailsInfo and downstream tests to include job-type
Adds a method to FlinkService + AbstractFlinkService to fetch JobDetailsInfo for the given job
Adds a method to SnapshotObserver to check if a job is a batch job. This method is used to skip observing checkpoints/savepoints for batch jobs.

Some questions

There is a local version of JobDetailsInfo in the operator code. Based on the existing comment, saying it can be removed when the client upgrades to 1.20, I tried to remove the local version, but ran into issues with the non-null requirement for slotSharingGroupId and jobType - particularly with the flink15 and flink16 compatibility tests. I've kept the local version and added job-type. Open to suggestions if you think there is a better way to handle this.
When running a batch job there are still some exceptions being logged as warnings that originate in populateStateSize - also related to checkpointing not being enabled for batch jobs. I didn't change anything there since those warnings/exceptions don't crash anything. Open to suggestions if you think we shouldn't leave this as is.
JobAutoScalerImplTest.testMetricReporting seemed flaky when running tests locally. Has this been observed before?

Verifying this change

All existing tests pass (modified a couple tests to include job-type)
Tested locally with the manifest attached to the issue to make sure the job successfully reaches FINISHED JOB STATUS

If you think my approach is good, as a next step, happy to add some tests for the new logic I've introduced. Where would you like to see tests?

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: yes

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

…ption and never reach FINISHED in the CR

...s-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SnapshotObserver.java

gyfora · 2025-02-24T18:42:11Z

...s-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SnapshotObserver.java

+                            });
+        } catch (Exception e) {
+            if (ExceptionUtils.findThrowable(e, RestClientException.class)
+                    .map(ex -> ex.getMessage().contains("Checkpointing has not been enabled"))


Is message always not null here?

Good point. I'm not sure. I'll add logic to check for null.

gyfora · 2025-02-24T18:43:26Z

...s-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SnapshotObserver.java

+            ctx.getFlinkService()
+                    .getLastCheckpoint(JobID.fromHexString(jobId), ctx.getObserveConfig())
+                    .ifPresentOrElse(
+                            snapshot -> jobStatus.setUpgradeSavepointPath(snapshot.getLocation()),
+                            () -> {
+                                if (ReconciliationUtils.isJobCancelled(status)) {
+                                    // For cancelled jobs the observed savepoint is always definite,
+                                    // so if empty we know the job doesn't have any
+                                    // checkpoints/savepoints
+                                    jobStatus.setUpgradeSavepointPath(null);
+                                }
+                            });
+        } catch (Exception e) {
+            if (ExceptionUtils.findThrowable(e, RestClientException.class)
+                    .map(ex -> ex.getMessage().contains("Checkpointing has not been enabled"))
+                    .orElse(false)) {
+                LOG.warn(
+                        "Checkpointing not enabled for job {}, skipping checkpoint observation",
+                        jobId,
+                        e);


I wonder if the try/catch logic should be part of getLastCheckpoint of the flink service. That would mean that anywhere else we call this in the future we get a consistently good behaviour for batch jobs

Yeah, I thought about that. I'm open to either. We would just have getLastCheckpoint catch the exception and return optional.empty()

I think that would make sense 👍

[FLINK-37370] [Observer] Finished batch jobs throw ReconciliationExce…

9fc75ca

…ption and never reach FINISHED in the CR

luca-p-castelli marked this pull request as ready for review February 21, 2025 17:49

gyfora reviewed Feb 22, 2025

View reviewed changes

...s-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/SnapshotObserver.java Outdated Show resolved Hide resolved

Handle exception instead and add observer test

71f53d8

luca-p-castelli requested a review from gyfora February 24, 2025 17:20

gyfora requested changes Feb 24, 2025

View reviewed changes

Move exception handling to getLastCheckpoint in the service

0abaa1f

luca-p-castelli requested a review from gyfora February 24, 2025 19:52

gyfora approved these changes Feb 25, 2025

View reviewed changes

gyfora merged commit 9eb3c38 into apache:main Feb 25, 2025
118 checks passed

luca-p-castelli mentioned this pull request Mar 13, 2025

[FLINK-37370] [Observer] Fix exception caught when handling checkpointing not enabled for batch jobs and add batch e2e test #955

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-37370] [Observer] Finished batch jobs throw ReconciliationException and never reach FINISHED in the CR #948

[FLINK-37370] [Observer] Finished batch jobs throw ReconciliationException and never reach FINISHED in the CR #948

Uh oh!

luca-p-castelli commented Feb 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

gyfora Feb 24, 2025

Uh oh!

luca-p-castelli Feb 24, 2025 •

edited

Loading

Uh oh!

gyfora Feb 24, 2025

Uh oh!

luca-p-castelli Feb 24, 2025

Uh oh!

gyfora Feb 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[FLINK-37370] [Observer] Finished batch jobs throw ReconciliationException and never reach FINISHED in the CR #948

[FLINK-37370] [Observer] Finished batch jobs throw ReconciliationException and never reach FINISHED in the CR #948

Uh oh!

Conversation

luca-p-castelli commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Some questions

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

Uh oh!

gyfora Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

luca-p-castelli Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gyfora Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

luca-p-castelli Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

gyfora Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luca-p-castelli commented Feb 21, 2025 •

edited

Loading

luca-p-castelli Feb 24, 2025 •

edited

Loading