Skip to content

Conversation

@mrginglymus
Copy link

I'm raising this PR as a minimal reproduction of the issue I noticed from a bug in the withChecks step of the checks-api plugin: jenkinsci/checks-api-plugin#83

The withChecks step does not handle all possible exceptions in onStart, onSuccess, and onFailure. If an exception gets thrown by one of these methods, while the job appears to fail correctly, the executor on which the step is running does not, and you end up with a execution that must be killed manually.

Whilst it is obviously the responsibility of the plugin to behave correctly, this failure mode is particularly pernicious. If you run a jenkins instance with on-demand cloud agents then any zombie executions will prevent that cloud agent from being shut down until you manually kill it. I have seen three day old zombies on Monday morning from an executor that was unnecessarily up all weekend.

This PR does not (yet) contain a fix as I haven't looked into the plumbing of how this actually works. If someone can quickly identify the quick fix for this then that would be greatly appreciated, otherwise I will put on my spelunking gear.

@mrginglymus
Copy link
Author

It's also worth noting that I first saw this issue from an unhandled in onSuccess and onFailure, both of which appear to also suffer from the same issue.

Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrginglymus Thanks for the PR!

If someone can quickly identify the quick fix for this then that would be greatly appreciated, otherwise I will put on my spelunking gear.

I think the fix would be over in https://github.com/jenkinsci/workflow-cps-plugin (and I would probably go ahead and move this PR over there as well and add the new test to CpsBodyExecutionTest).

I am not sure exactly what a fix would look like. I would probably start by adding some try/catch blocks in CpsBodyExecution.start, CpsBodyExecution$FailureAdapter.receive, and CpsBodyExecution$SuccessAdapter.receive that just log exceptions thrown by the callbacks, and then I would add a test that fails in each method to see what other changes would be needed to prevent the execution's state from being corrupted if an exception is thrown by a BodyExecutionCallback. I am not sure if we can handle this case in general though: I think that for some steps (maybe parallel), the callbacks may need to execute for the step to be able to be cleaned up successfully. I think we should be able to handle failures in onStart though by just immediately aborting the step.

@mrginglymus
Copy link
Author

Great, thanks for the pointers - I realised after raising this that this probably wasn't the right place. I'll move it over and have a play.

@mrginglymus
Copy link
Author

Finally got round to raising jenkinsci/workflow-cps-plugin#422

@mrginglymus mrginglymus closed this Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants