-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Ensure that Operations are aborted when MapTaskExecutor is closed. Add tests around setup/teardown of DoFns #36631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1633d31 to
2ae939a
Compare
e1468a3 to
4bf5b7f
Compare
|
Error looks unrelated, sent out #36754 |
|
Run Java PreCommit |
|
Assigning reviewers: R: @Abacn added as fallback since no labels match configuration Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, had a few comments. Would you mind updating the PR description noting which scenario this PR is supposed to fix?
Reading #18592 (comment) it seems this fixed teardown on setup exception on Dataflow streaming runner. Would it also apply to batch? However I tested this PR with my old test case for #31381, currently the behavior appears remain the same as HEAD. In particular, I have 3 fused DoFn's, and throw in DoFn2's setup.
For batch, only DoFn2's teardown is invoked. However I also notice previously setup DoFn not re-setup as well. e.g.
@Setup3 null, 743 at 277fbd5a
@Setup2 null, 510 at 67875efd
@Teardown2 510 at 67875efd
Failure processing work item google.com:clouddfe;2025-11-12_09_44_38-12709838531006866747;5913875962390955537: Uncaught exception occurred during work unit execution. This will be retried.
Finished processing stage s01 with 1 errors in 0.87 seconds
Starting MapTask stage s01
@Setup2 null, 399 at 145c86d
@Setup null, 309 at 23ddbaee
@StartBundle3 743 at 277fbd5a
...
(job id 2025-11-12_09_44_38-12709838531006866747)
for streaming the runner creates 20 DoFn instances at one time, one crashed at setup and called its teardown, but did not re-setup a DoFn2 (I still get 20 counts for DoFn2 setup) (job id 2025-11-12_11_32_33-12848924584850197035)
| } catch (RuntimeException exn) { | ||
| for (Operation o : createdOperations) { | ||
| try { | ||
| o.abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are now evaluating teardown for all pardos when one throw. Given the change would it be straightforward to fix for the case DoFn finished normally (like changing catch to finally ), or it would need further consideration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is when creating the executor, if there is no error creating it then we don't want to abort here because the executor is then returned and reused many times. If we want to ensure we teardown DoFns, I think that we need to have some timeout on the internal cache of DoFns in DoFnInstanceManagers.java
...org/apache/beam/runners/dataflow/worker/windmill/work/processing/StreamingWorkScheduler.java
Show resolved
Hide resolved
…hat previously created dofns are torn down. Ensure that when we invalidate a WorkExecutor that we close the MapTaskExecutor which ends up calling teardown on the Dofns. This actually doesn't matter because the Operators already teardown if there is an exception during processing and SimpleParDoFn checks out and returns the actual dofn only during bundle processing. Added test coverage of this.
Sorry I clarified what this PR does in the description and how our expectations of when setup/teardown are called were incorrect. So this fixes some missing abort calls, but those calls ended up being no-ops for ParDoOperation (maybe needed for other operation types though). This PR adds testing validating reuse of dofns.
Hopefully this is clarified by the updated PR description. This is actually the expected and good behavior of the code. We don't need to throw away DoFn2 in the case DoFn had an error setup. The dofns are independent at this point and we haven't interacted in such a way with DoFn2 instance that we can't reuse it. We are still respecting the lifecycle of DoFn2 and instead of teardown/resetup, we just cache and reuse the Dofn2 instance when we are recreating a new MapTaskExecutor. |
| } catch (RuntimeException exn) { | ||
| for (Operation o : createdOperations) { | ||
| try { | ||
| o.abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is when creating the executor, if there is no error creating it then we don't want to abort here because the executor is then returned and reused many times. If we want to ensure we teardown DoFns, I think that we need to have some timeout on the internal cache of DoFns in DoFnInstanceManagers.java
...org/apache/beam/runners/dataflow/worker/windmill/work/processing/StreamingWorkScheduler.java
Show resolved
Hide resolved
...c/test/java/org/apache/beam/runners/dataflow/worker/IntrinsicMapTaskExecutorFactoryTest.java
Show resolved
Hide resolved
4bf5b7f to
b7448da
Compare
This PR began with adding unit testing to expose the reported problems with #18592.
The added tests did show that we are not calling Operation.abort in all cases:
After addressing these fixes we call abort on Operations (including ParDoOperation) when they are being abandoned. Additional tests were then added at the DoFn level to verify that teardown was called and surprisingly showed that we were still not calling teardown. Further investigation shows that this is due ParDoOperation containing a SimpleParDoFn which lazily claims a DoFn from DoFnInstanceManager which has an internal cache only when a bundle begins processing, releasing the DoFn back to the cache in finishBundle. Since the cases where we were missing calls to ParDoOperation.abort (and addressed by changes in the PR) were not between startBundle/finishBundle, ParDoOperation.abort is a no-op.
So these tests show that the expectations we had about when teardown were called were incorrect and that we are not leaking DoFns but are instead caching them and would reuse them on retry. The added StreamingDataflowWorker test verifies this behavior.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.