Skip to content

Fix ContainerExecProc joinWithTimeout deadlock#2789

Open
cronik wants to merge 1 commit intojenkinsci:masterfrom
cronik:bugfix/container-exec-proc-join-timeout
Open

Fix ContainerExecProc joinWithTimeout deadlock#2789
cronik wants to merge 1 commit intojenkinsci:masterfrom
cronik:bugfix/container-exec-proc-join-timeout

Conversation

@cronik
Copy link
Copy Markdown
Contributor

@cronik cronik commented Jan 7, 2026

This change updates the ContainerExecProc#kill method to force the finished countdown latch to decrement. It has been observed in some high load clusters where the joinWithTimeout timeout is reached but the proc continues to be blocked.

When joinWithTimeout is called, the kill method is called if the task does not complete in time.

https://github.com/jenkinsci/jenkins/blob/368f1ccbc967a85c0ff801f3729cb77a269afd41/core/src/main/java/hudson/Proc.java#L165

But if kill fails to trigger the finished countdown latch then the join method will continue to wait indefinitely.

By forcing finished.countDown() after close the join should be unblocked even if the ctl-c command didn't trigger the exec listener. countDown is a no-op if the latch is already zero.

Fixes #2683

Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

@cronik cronik requested a review from a team as a code owner January 7, 2026 02:38
@cronik
Copy link
Copy Markdown
Contributor Author

cronik commented Jan 10, 2026

Thread dump of deadlock

"org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution [#536]" Id=362583 Group=main WAITING on java.util.concurrent.CountDownLatch$Sync@3621b128
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/jdk.internal.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.CountDownLatch$Sync@3621b128
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1047)
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecProc.join(ContainerExecProc.java:100)
    at hudson.Proc.joinWithTimeout(Proc.java:172)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.EphemeralContainerStepExecution.setDefaultRunAsUser(EphemeralContainerStepExecution.java:428)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.EphemeralContainerStepExecution.startEphemeralContainer(EphemeralContainerStepExecution.java:184)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.EphemeralContainerStepExecution.startEphemeralContainerWithRetry(EphemeralContainerStepExecution.java:112)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.EphemeralContainerStepExecution$$Lambda$2125/0x0000000801d38b50.run(Unknown Source)
    at PluginClassLoader for workflow-step-api//org.jenkinsci.plugins.workflow.steps.GeneralNonBlockingStepExecution.lambda$run$0(GeneralNonBlockingStepExecution.java:77)
    at PluginClassLoader for workflow-step-api//org.jenkinsci.plugins.workflow.steps.GeneralNonBlockingStepExecution$$Lambda$2121/0x00000008013ae3b0.run(Unknown Source)
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at [java.base@17.0.17-internal](mailto:java.base@17.0.17-internal)/java.lang.Thread.run(Thread.java:840)

    Number of locked synchronizers = 1
    - java.util.concurrent.ThreadPoolExecutor$Worker@53a0676c

@cronik
Copy link
Copy Markdown
Contributor Author

cronik commented Jan 13, 2026

Noticed #1538 attempted to solve this same deadlock issue I observed. This implementation attempts a different solution to the problem.

This change updates the `ContainerExecProc#kill` method
to force the finished countdown latch to decrement. It has
been observed in some high load clusters where the
`joinWithTimeout` timeout is reached but the proc continues
to be blocked.

When `joinWithTimeout` is called, the `kill` method is called if the
task does not complete in time.

https://github.com/jenkinsci/jenkins/blob/368f1ccbc967a85c0ff801f3729cb77a269afd41/core/src/main/java/hudson/Proc.java#L165

But if `kill` fails to trigger the `finished` countdown latch then the
`join` method will continue to wait indefinitely.

https://github.com/jenkinsci/kubernetes-plugin/blob/676ab933d12ad8b25e4d7f78594a32066aad2569/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecProc.java#L100

By forcing `finished.countDown()` after `close` the join should be unblocked even if the `ctl-c` command didn't trigger the exec listener. `countDown` is a no-op if the latch is already zero.
@cronik cronik force-pushed the bugfix/container-exec-proc-join-timeout branch from be5a61b to 7c90ef4 Compare March 22, 2026 21:30
@cronik cronik changed the title Fix ContainerExecProc joinWithTimeout Fix ContainerExecProc joinWithTimeout deadlock Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[JENKINS-72792] Builds hang sometimes for withMaven and ssh-agent steps execution

1 participant