Skip to content

Conversation

@nishita-09
Copy link
Contributor

@nishita-09 nishita-09 commented Jul 11, 2025

What is the purpose of the change

Currently the lifecycle state shows STABLE even if an application deployment was deleted and stays in MISSING / reconciling state. This would fix the lifecycle status of the deployment which would be FAILED in cases when the JM is in MISSING/ERROR state with ERROR: configmaps have been deleted indicating TERMINAL error.

Brief change log

  • Added logic to detect unrecoverable FlinkDeployment scenarios
  • Inserted after existing job status FAILED check
  • *Checks for JobManagerStatus:
    • MISSING JobManager Deployment with error ({"type":"org.apache.flink.kubernetes.operator.exception.UpgradeFailureException","message":"HA metadata not available to restore from last state. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. ","additionalMetadata":{},"throwableList":[]}) → Return FAILED lifecycle state
    • MISSING JobManager Deployment (no error or recoverable error) → Return DEPLOYED lifecycle state
    • ERROR JobManager Deployment without terminal error -> Return DEPLOYED lifecycle state

Verifying this change

This change added tests and can be verified as follows:

  • Added Unit Tests in ResourceLifecycleMetricsTest to validate Lifecycle Status for flinkDeployment
  • Manually verified the change by running a cluster with 3 sets of Application Flink Clusters:
    • Application Deployment 1 -> with invalid image -> ERRORED JobManager Deployment Status -> DEPLOYED Lifecycle Status of flinkDeployment
    • Application Deployment 2 -> with valid image -> Deleted JobManager Deployment -> Caused Missing JM Status -> FAILED Lifecycle Status of flinkDeployment
    • Application Deployment 3 -> with valid image ->with invalid node selector -> Jobmanager Deployment Status: DEPLOYING -> Hence the Lifecycle Status is STABLE
Screenshot 2025-07-13 at 4 35 19 AM Screenshot 2025-07-13 at 4 35 30 AM Screenshot 2025-07-13 at 4 35 39 AM Screenshot 2025-07-13 at 1 02 20 PM Screenshot 2025-07-13 at 1 03 24 PM

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: yes

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@nishita-09 nishita-09 changed the title [FLINK-32033][Kubernetes-Operator] Fix Lifecycle status in case of MISSING/ERROR JM status [FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status Jul 11, 2025
nishita-pattanayak added 2 commits July 13, 2025 12:57
…SSING/ERROR JM status with unrecoverable error
…SSING/ERROR JM status with unrecoverable error
Comment on lines 103 to 110
&& (error.toLowerCase()
.contains(
"it is possible that the job has finished or terminally failed, or the configmaps have been deleted")
|| error.toLowerCase().contains("manual restore required")
|| error.toLowerCase().contains("ha metadata not available")
|| error.toLowerCase()
.contains(
"ha data is not available to make stateful upgrades"))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we checking this specific error?
In any case we are the ones triggering this error so please create a constant in the AbstractJobReconciler and use that here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyfora this seems to be the only case when we know that the cluster cannot recover on its own and needs a manual restore. hence used this. Will set this as a constant instead for a cleaner code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyfora

  1. There are multiple instances where HA metadata not available is written in different forms like HA metadata not available and HA data is not available. Should we maintain a uniformity in these by changing these exception messages using a constant (now that it is available).

  2. Also currently flink-operator-api does not have flink-operator as a dependency -> to use the constants in AbstractJobReconciler we would have to import it as a dependency as the status change logic resides in flink-operator-api.
    Should I still go ahead with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible let's use a single constant, and we can keep that constant in the operator api module so the reconciler can use it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyfora
I have added 3 constants for error messages which are frequently used and would mean that they are terminal, and referenced those in the reconcilers to maintain uniformity. I have also tried to keep the net changes minimum (Although a few error messages would differ slightly). Do let me know if this looks good?

@nishita-09 nishita-09 requested a review from gyfora July 14, 2025 14:38
@nishita-09
Copy link
Contributor Author

nishita-09 commented Jul 21, 2025

@gyfora I have corrected the tests that were failing in the above run due to string changes. Validated on local. All tests passed. Can you please rerun the build?

@nishita-09
Copy link
Contributor Author

@gyfora The branch was not updated hence had to update it so this would need to be built again. Also kindly, do let me know if there are any further comments on the changes. Will get on top of it , if any.

@gyfora gyfora merged commit d5f4753 into apache:main Jul 28, 2025
121 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants