[FLINK-37730] Improve exception recording ts initialization + 2.0 compatibility #983

gyfora · 2025-05-26T14:19:02Z

What is the purpose of the change

Initialize last triggered event timestamp correctly from kubernetes events + fix 2.0 compatibility

Verifying this change

Manually verified (Flink 1.18-2.0) + Unit tests

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: no

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

vsantwana · 2025-05-27T06:19:47Z

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

+                        lastExceptionTs =
+                                EventUtils.findLastJobExceptionTsFromK8s(
+                                                ctx.getKubernetesClient(), resource)
+                                        .orElse(Instant.now().minus(MAX_K8S_EVENT_AGE));


Suggested change

.orElse(Instant.now().minus(MAX_K8S_EVENT_AGE));

.orElse(k8sExpirationTs);

Good catch, I cleaned up / simplified the duplicated code in the method in a new commit, please check :)

vsantwana · 2025-05-27T06:21:18Z

Thanks @gyfora for the PR!
Left two very minor comments

rmetzger · 2025-05-27T06:34:57Z

I tested this PR on a dev env yesterday, and it all works (against Flink 1.19)

vsantwana

LGTM!
Thanks @gyfora

mxm · 2025-06-02T07:32:47Z

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java

+                    if (maxJobExceptionTs.isBefore(k8sExpirationTs)) {
+                        // If the last job exception was a long time ago, then there is no point in
+                        // checking in k8s.
+                        lastExceptionTs = maxJobExceptionTs;


Any reason for this optimization? It complicates the code by adding another setting. It also requires the user to tune just another setting. There is no harm in calling out to the k8s api regularly to fetch events.

there is no config for this (nothing to tune) and the optimization can be very important when the operator starts up because then the cache is empty and it would fetch events for every single job. In most cases this filter completely eliminates that so this greatly reduces the startup api server load

Fair point. The value is hardcoded. We would only query for the jobs with exceptions, but still those could amount to quite some jobs.

[hotfix] Make exception reporting compatible with Flink 2.0

46ee95c

gyfora force-pushed the FLINK-37730 branch from f65f16a to 4b297e2 Compare May 26, 2025 14:33

[FLINK-37730] Improve exception recording ts initialization

4cf7f8a

gyfora force-pushed the FLINK-37730 branch from 4b297e2 to 4cf7f8a Compare May 26, 2025 14:56

vsantwana reviewed May 27, 2025

View reviewed changes

...-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/JobStatusObserver.java Outdated Show resolved Hide resolved

vsantwana reviewed May 27, 2025

View reviewed changes

gyfora added 2 commits May 27, 2025 09:16

cleanup

71d9581

Improve logging

4876184

vsantwana approved these changes May 27, 2025

View reviewed changes

rmetzger approved these changes May 27, 2025

View reviewed changes

gyfora merged commit 3c60c3c into apache:main May 27, 2025
130 checks passed

mxm reviewed Jun 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-37730] Improve exception recording ts initialization + 2.0 compatibility #983

[FLINK-37730] Improve exception recording ts initialization + 2.0 compatibility #983

Uh oh!

gyfora commented May 26, 2025

Uh oh!

Uh oh!

vsantwana May 27, 2025 •

edited

Loading

Uh oh!

gyfora May 27, 2025

Uh oh!

vsantwana commented May 27, 2025

Uh oh!

rmetzger commented May 27, 2025

Uh oh!

vsantwana left a comment

Uh oh!

Uh oh!

mxm Jun 2, 2025

Uh oh!

gyfora Jun 2, 2025

Uh oh!

mxm Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	.orElse(Instant.now().minus(MAX_K8S_EVENT_AGE));
	.orElse(k8sExpirationTs);

[FLINK-37730] Improve exception recording ts initialization + 2.0 compatibility #983

[FLINK-37730] Improve exception recording ts initialization + 2.0 compatibility #983

Uh oh!

Conversation

gyfora commented May 26, 2025

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Uh oh!

Uh oh!

vsantwana May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gyfora May 27, 2025

Choose a reason for hiding this comment

Uh oh!

vsantwana commented May 27, 2025

Uh oh!

rmetzger commented May 27, 2025

Uh oh!

vsantwana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mxm Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

gyfora Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

mxm Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vsantwana May 27, 2025 •

edited

Loading