[SPARK-53706] App reconcile steps should properly handle exceptions in status update #341

jiangzho · 2025-09-25T02:00:48Z

What changes were proposed in this pull request?

This PR adds exception handling for status recorder when persisting status. If encountered exceptions in updating app status, reconciler would finish current reconcile loop and requeue a new one instead of blindly throw the exception.

Why are the changes needed?

SparkAppReconciler is not handling exceptions when updating app status - these failed reconcile loops may end up with endless retry if the status update is caused by conflicts.

For example, we observe exceptions like these when app is starting

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status. Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=spark.apache.org, kind=sparkapplications, name=spark-example-retain-duration, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]

Why is this happening ? Because reconcile can be triggered again by driver pod status update while another reconcile is in-progress. Without proper exception handling, this would keep recurring, taking unnecessary CPU time and confusing user in logs.

Another corner case that would be fixed by this patch: current AppInitStep would mark app as SchedulingFailure even if the resources are requested as expected and only the status update fails. This patch handles exceptions for resource creation and status update separately.

We'd better digest this better: if an exception is thrown while updating app status (which is typically at the last of each reconcile) - operator shall properly finish this reconcile loop and start a new one. App status is fetched from cache at the beginning of each reconcile - and our reconcile steps are ready designed to be idempotent.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested from two perspectives:

CIs - also added new unit test to validate update idempotency in init step
Check logs : with rthis patch, no more exceptions like above is thrown while running apps

Was this patch authored or co-authored using generative AI tooling?

No

…n status update ### What changes were proposed in this pull request? This PR adds exception handling for status recorder when persisting status. If encountered exceptions in updating app status, reconciler would finish current reconcile loop and requeue a new one instead of blindly throw the exception. ### Why are the changes needed? SparkAppReconciler is not handling exceptions when updating app status - these failed reconcile loops may end up with endless retry if the status update is caused by conflicts. For example, we observe exceptions like these when app is starting ``` Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status. Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=spark.apache.org, kind=sparkapplications, name=spark-example-retain-duration, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}). at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?] at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?] at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] ``` Why is this happening ? Because reconcile can be triggered again by driver pod status update while another reconcile is in-progress. Without proper exception handling, this would keep recurring. We'd better digest this better: if an exception is thrown while updating app status (which is typically at the last of each reconcile) - operator shall properly finish this reconcile loop and start a new one. App status is fetched from cache at the beginning of each reconcile - and our reconcile steps are ready designed to be idempotent. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested from two perspectives: 1. CIs - also added new unit test to validate update idempotency in init step 2. Check logs : with rthis patch, no more exceptions like above is thrown while running apps ### Was this patch authored or co-authored using generative AI tooling? No

dongjoon-hyun

+1, LGTM (Pending CIs). Thank you, @jiangzho .

dongjoon-hyun

Could you fix the integration test failures, @jiangzho ?

jiangzho · 2025-09-25T21:45:45Z

Thanks @dongjoon-hyun for the quick review!

Figured out the failure was caused in one of the running step requeue handling - that leads to a 2 min latency in between Succeeded and ResourceReleased. It's nice to see our e2e catches that.

dongjoon-hyun

+1, LGTM. Ya, it's great to catch it. Thank you, @jiangzho .

Merged to main for v0.5.

github-actions bot added the OPERATOR label Sep 25, 2025

dongjoon-hyun approved these changes Sep 25, 2025

View reviewed changes

dongjoon-hyun requested changes Sep 25, 2025

View reviewed changes

jiangzho marked this pull request as draft September 25, 2025 17:08

Trigger immediate reconcile after driver status change

8ad7b9a

dongjoon-hyun marked this pull request as ready for review September 25, 2025 20:48

dongjoon-hyun marked this pull request as draft September 25, 2025 20:48

jiangzho marked this pull request as ready for review September 25, 2025 21:43

dongjoon-hyun approved these changes Sep 25, 2025

View reviewed changes

dongjoon-hyun closed this in e80dead Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53706] App reconcile steps should properly handle exceptions in status update #341

[SPARK-53706] App reconcile steps should properly handle exceptions in status update #341

Uh oh!

jiangzho commented Sep 25, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

jiangzho commented Sep 25, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-53706] App reconcile steps should properly handle exceptions in status update #341

[SPARK-53706] App reconcile steps should properly handle exceptions in status update #341

Uh oh!

Conversation

jiangzho commented Sep 25, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

jiangzho commented Sep 25, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants