Skip to content

Fix linux-sandbox actions being killed prematurely when using virtual threads#29007

Open
jjudd wants to merge 1 commit intobazelbuild:masterfrom
lucidsoftware:jjudd-virtual-thread-linux-sandbox-fix
Open

Fix linux-sandbox actions being killed prematurely when using virtual threads#29007
jjudd wants to merge 1 commit intobazelbuild:masterfrom
lucidsoftware:jjudd-virtual-thread-linux-sandbox-fix

Conversation

@jjudd
Copy link
Contributor

@jjudd jjudd commented Mar 16, 2026

This change fixes a bug where linux-sandbox actions that take longer than 30 seconds can be killed prematurely and cause the build to fail.

I created a minimal repro case with instructions for this bug here: https://github.com/lucidsoftware/bazel_virtual_thread_repro

There are two things that contribute to this bug:

Part one:
When using virtual threads with the JavaSubprocessFactory, the process is forked from the currently running virtual thread's carrier thread. The virtual thread then waits for the subprocess to complete, which causes the carrier thread to detach. If there is no other work for the carrier thread to perform, it becomes idle. If it is idle for long enough it can be killed by the ForkJoinPool it is a member of. The default virtual thread scheduler in JDK 25 uses a ForkJoinPool with an idle TTL of 30 seconds.

Part two:
The linux-sandbox process sends SIGKILL to the child process when its parent dies. This is setup like so: prctl(PR_SET_PDEATHSIG, SIGKILL).

These two things combined cause problems for Bazel builds if you:

  • Use the linux-sandbox strategy
  • Use virtual threads via --experimental_async_execution
  • Have an action which takes longer than 30 seconds to complete

If the Bazel server isn't very busy while that 30 second action is running, the parent carrier thread can be killed due to inactivity, which causes the linux-sandbox process to die, which causes the build to fail.

This is only an issue if you're using virtual threads. When using platform threads they directly wait on the subprocess to complete. The thread is considered busy while waiting, which prevents it from being killed.

This change fixes the issue by always forking from a long-lived platform thread. That way there is no risk of the platform thread being killed prematurely. The fix does so while still enabling the linux-sandbox processes to be properly killed when the Bazel server is killed.

@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label Mar 16, 2026
@jjudd
Copy link
Contributor Author

jjudd commented Mar 16, 2026

I didn't add a test with this change becauase I wasn't certain how to test this without the test taking 30+ seconds. Even if folks are open to a 30 second test, that 30 second constant is controlled by the JDK, so the test could be broken by a JDK update that changes it from 30 seconds to some other value.

If folks have any idea on how to test this effectively, please let me know and I can add a test for it.

@jjudd jjudd force-pushed the jjudd-virtual-thread-linux-sandbox-fix branch from e4aff0e to ee9c89e Compare March 16, 2026 07:11
@meisterT meisterT requested a review from coeuvre March 16, 2026 13:46
@fmeum
Copy link
Collaborator

fmeum commented Mar 16, 2026

Nice catch!

@coeuvre
Copy link
Member

coeuvre commented Mar 17, 2026

Nice catch indeed!

Would it possible / does it make sense to add a unit test that asserts the ProcessBuilder is called from a platform thread, even the JavaSubprocessFactory.create is called from virtual thread?

@jjudd
Copy link
Contributor Author

jjudd commented Mar 17, 2026

Good idea. I'll go poke around at that today and see what I can come up with.

@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Mar 17, 2026
… threads

This change fixes a bug where linux-sandbox actions that take longer
than 30 seconds can be killed prematurely and cause the build to fail.

There are two things that contribute to this bug:

Part one:
When using virtual threads with the JavaSubprocessFactory, the process
is forked from the currently running virtual thread's carrier thread.
The virtual thread then waits for the subprocess to complete, which
causes the carrier thread to detach. If there is no other work for the
carrier thread to perform, it becomes idle. If it is idle for long
enough it can be killed by the ForkJoinPool it is a member of. The
default virtual thread scheduler in JDK 25 uses a ForkJoinPool with an
idle TTL of 30 seconds.

Part two:
The linux-sandbox process sends SIGKILL to the child process when its
parent dies. This is setup like so: `prctl(PR_SET_PDEATHSIG, SIGKILL)`.

These two things combined cause problems for Bazel builds if you:
 * Use the linux-sandbox strategy
 * Use virtual threads via --experimental_async_execution
 * Have an action which takes longer than 30 seconds to complete

If the Bazel server isn't very busy while that 30 second action is
running, the parent carrier thread can be killed due to inactivity,
which causes the linux-sandbox process to die, which causes the build to
fail.

This is only an issue if you're using virtual threads. When using
platform threads they directly wait on the subprocess to complete. The
thread is considered busy while waiting, which prevents it from being
killed.

This change fixes the issue by always forking from a long-lived platform
thread. That way there is no risk of the platform thread being killed
prematurely. The fix does so while still enabling the linux-sandbox
processes to be properly killed when the Bazel server is killed.
@jjudd
Copy link
Contributor Author

jjudd commented Mar 18, 2026

I added a test to this PR that asserts the process builder is called from a virtual thread and the forking happens from the correct long-lived platform thread.

I think this should be ready for folks to take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants