Skip to content

Conversation

@jiangzho
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds exception handling for status recorder when persisting status. If encountered exceptions in updating app status, reconciler would finish current reconcile loop and requeue a new one instead of blindly throw the exception.

Why are the changes needed?

SparkAppReconciler is not handling exceptions when updating app status - these failed reconcile loops may end up with endless retry if the status update is caused by conflicts.

For example, we observe exceptions like these when app is starting

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status. Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=spark.apache.org, kind=sparkapplications, name=spark-example-retain-duration, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]

Why is this happening ? Because reconcile can be triggered again by driver pod status update while another reconcile is in-progress. Without proper exception handling, this would keep recurring, taking unnecessary CPU time and confusing user in logs.

Another corner case that would be fixed by this patch: current AppInitStep would mark app as SchedulingFailure even if the resources are requested as expected and only the status update fails. This patch handles exceptions for resource creation and status update separately.

We'd better digest this better: if an exception is thrown while updating app status (which is typically at the last of each reconcile) - operator shall properly finish this reconcile loop and start a new one. App status is fetched from cache at the beginning of each reconcile - and our reconcile steps are ready designed to be idempotent.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested from two perspectives:

  1. CIs - also added new unit test to validate update idempotency in init step
  2. Check logs : with rthis patch, no more exceptions like above is thrown while running apps

Was this patch authored or co-authored using generative AI tooling?

No

…n status update

### What changes were proposed in this pull request?

This PR adds exception handling for status recorder when persisting status. If encountered exceptions in updating app status, reconciler would finish current reconcile loop and requeue a new one instead of blindly throw the exception.

### Why are the changes needed?

SparkAppReconciler is not handling exceptions when updating app status - these failed reconcile loops may end up with endless retry if the status update is caused by conflicts.

For example, we observe exceptions like these when app is starting

```
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status. Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=spark.apache.org, kind=sparkapplications, name=spark-example-retain-duration, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on sparkapplications.spark.apache.org "spark-example-retain-duration": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141) ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
```

Why is this happening ? Because reconcile can be triggered again by driver pod status update while another reconcile is in-progress. Without proper exception handling, this would keep recurring.

We'd better digest this better: if an exception is thrown while updating app status (which is typically at the last of each reconcile) - operator shall properly finish this reconcile loop and start a new one. App status is fetched from cache at the beginning of each reconcile - and our reconcile steps are ready designed to be idempotent.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Tested from two perspectives:

1. CIs - also added new unit test to validate update idempotency in init step
2. Check logs : with rthis patch, no more exceptions like above is thrown while running apps

### Was this patch authored or co-authored using generative AI tooling?

No
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thank you, @jiangzho .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the integration test failures, @jiangzho ?

@jiangzho jiangzho marked this pull request as draft September 25, 2025 17:08
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review September 25, 2025 20:48
@dongjoon-hyun dongjoon-hyun marked this pull request as draft September 25, 2025 20:48
@jiangzho jiangzho marked this pull request as ready for review September 25, 2025 21:43
@jiangzho
Copy link
Contributor Author

Thanks @dongjoon-hyun for the quick review!

Figured out the failure was caused in one of the running step requeue handling - that leads to a 2 min latency in between Succeeded and ResourceReleased. It's nice to see our e2e catches that.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Ya, it's great to catch it. Thank you, @jiangzho .

Merged to main for v0.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants