-
Notifications
You must be signed in to change notification settings - Fork 47
[SPARK-53706] App reconcile steps should properly handle exceptions in status update #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n status update ### What changes were proposed in this pull request? This PR adds exception handling for status recorder when persisting status. If encountered exceptions in updating app status, reconciler would finish current reconcile loop and requeue a new one instead of blindly throw the exception. ### Why are the changes needed? SparkAppReconciler is not handling exceptions when updating app status - these failed reconcile loops may end up with endless retry if the status update is caused by conflicts. For example, we observe exceptions like these when app is starting ``` Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status. Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=spark.apache.org, kind=sparkapplications, name=spark-example-retain-duration, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}). at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?] at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?] at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?] ``` Why is this happening ? Because reconcile can be triggered again by driver pod status update while another reconcile is in-progress. Without proper exception handling, this would keep recurring. We'd better digest this better: if an exception is thrown while updating app status (which is typically at the last of each reconcile) - operator shall properly finish this reconcile loop and start a new one. App status is fetched from cache at the beginning of each reconcile - and our reconcile steps are ready designed to be idempotent. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested from two perspectives: 1. CIs - also added new unit test to validate update idempotency in init step 2. Check logs : with rthis patch, no more exceptions like above is thrown while running apps ### Was this patch authored or co-authored using generative AI tooling? No
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (Pending CIs). Thank you, @jiangzho .
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the integration test failures, @jiangzho ?
|
Thanks @dongjoon-hyun for the quick review! Figured out the failure was caused in one of the running step requeue handling - that leads to a 2 min latency in between |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Ya, it's great to catch it. Thank you, @jiangzho .
Merged to main for v0.5.
What changes were proposed in this pull request?
This PR adds exception handling for status recorder when persisting status. If encountered exceptions in updating app status, reconciler would finish current reconcile loop and requeue a new one instead of blindly throw the exception.
Why are the changes needed?
SparkAppReconciler is not handling exceptions when updating app status - these failed reconcile loops may end up with endless retry if the status update is caused by conflicts.
For example, we observe exceptions like these when app is starting
Why is this happening ? Because reconcile can be triggered again by driver pod status update while another reconcile is in-progress. Without proper exception handling, this would keep recurring, taking unnecessary CPU time and confusing user in logs.
Another corner case that would be fixed by this patch: current
AppInitStepwould mark app asSchedulingFailureeven if the resources are requested as expected and only the status update fails. This patch handles exceptions for resource creation and status update separately.We'd better digest this better: if an exception is thrown while updating app status (which is typically at the last of each reconcile) - operator shall properly finish this reconcile loop and start a new one. App status is fetched from cache at the beginning of each reconcile - and our reconcile steps are ready designed to be idempotent.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tested from two perspectives:
Was this patch authored or co-authored using generative AI tooling?
No