Incorrectly written BodyExecutionCallback implementations lead to zombie executions #60

mrginglymus · 2021-02-14T13:08:44Z

I'm raising this PR as a minimal reproduction of the issue I noticed from a bug in the withChecks step of the checks-api plugin: jenkinsci/checks-api-plugin#83

The withChecks step does not handle all possible exceptions in onStart, onSuccess, and onFailure. If an exception gets thrown by one of these methods, while the job appears to fail correctly, the executor on which the step is running does not, and you end up with a execution that must be killed manually.

Whilst it is obviously the responsibility of the plugin to behave correctly, this failure mode is particularly pernicious. If you run a jenkins instance with on-demand cloud agents then any zombie executions will prevent that cloud agent from being shut down until you manually kill it. I have seen three day old zombies on Monday morning from an executor that was unnecessarily up all weekend.

This PR does not (yet) contain a fix as I haven't looked into the plumbing of how this actually works. If someone can quickly identify the quick fix for this then that would be greatly appreciated, otherwise I will put on my spelunking gear.

mrginglymus · 2021-02-14T13:16:36Z

It's also worth noting that I first saw this issue from an unhandled in onSuccess and onFailure, both of which appear to also suffer from the same issue.

dwnusbaum

@mrginglymus Thanks for the PR!

If someone can quickly identify the quick fix for this then that would be greatly appreciated, otherwise I will put on my spelunking gear.

I think the fix would be over in https://github.com/jenkinsci/workflow-cps-plugin (and I would probably go ahead and move this PR over there as well and add the new test to CpsBodyExecutionTest).

I am not sure exactly what a fix would look like. I would probably start by adding some try/catch blocks in CpsBodyExecution.start, CpsBodyExecution$FailureAdapter.receive, and CpsBodyExecution$SuccessAdapter.receive that just log exceptions thrown by the callbacks, and then I would add a test that fails in each method to see what other changes would be needed to prevent the execution's state from being corrupted if an exception is thrown by a BodyExecutionCallback. I am not sure if we can handle this case in general though: I think that for some steps (maybe parallel), the callbacks may need to execute for the step to be able to be cleaned up successfully. I think we should be able to handle failures in onStart though by just immediately aborting the step.

mrginglymus · 2021-02-16T21:43:34Z

Great, thanks for the pointers - I realised after raising this that this probably wasn't the right place. I'll move it over and have a play.

mrginglymus · 2021-03-03T15:01:40Z

Finally got round to raising jenkinsci/workflow-cps-plugin#422

Add test to demonstrate zombie executions

5a7a33f

mrginglymus mentioned this pull request Feb 14, 2021

Handle failure in publish invocation jenkinsci/checks-api-plugin#83

Closed

dwnusbaum reviewed Feb 15, 2021

View reviewed changes

mrginglymus closed this Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrectly written BodyExecutionCallback implementations lead to zombie executions #60

Incorrectly written BodyExecutionCallback implementations lead to zombie executions #60

Uh oh!

mrginglymus commented Feb 14, 2021

Uh oh!

mrginglymus commented Feb 14, 2021

Uh oh!

dwnusbaum left a comment

Uh oh!

mrginglymus commented Feb 16, 2021

Uh oh!

mrginglymus commented Mar 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Incorrectly written BodyExecutionCallback implementations lead to zombie executions #60

Incorrectly written BodyExecutionCallback implementations lead to zombie executions #60

Uh oh!

Conversation

mrginglymus commented Feb 14, 2021

Uh oh!

mrginglymus commented Feb 14, 2021

Uh oh!

dwnusbaum left a comment

Choose a reason for hiding this comment

Uh oh!

mrginglymus commented Feb 16, 2021

Uh oh!

mrginglymus commented Mar 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants