Fix linux-sandbox actions being killed prematurely when using virtual threads#29007
Fix linux-sandbox actions being killed prematurely when using virtual threads#29007jjudd wants to merge 1 commit intobazelbuild:masterfrom
Conversation
|
I didn't add a test with this change becauase I wasn't certain how to test this without the test taking 30+ seconds. Even if folks are open to a 30 second test, that 30 second constant is controlled by the JDK, so the test could be broken by a JDK update that changes it from 30 seconds to some other value. If folks have any idea on how to test this effectively, please let me know and I can add a test for it. |
e4aff0e to
ee9c89e
Compare
|
Nice catch! |
|
Nice catch indeed! Would it possible / does it make sense to add a unit test that asserts the ProcessBuilder is called from a platform thread, even the JavaSubprocessFactory.create is called from virtual thread? |
|
Good idea. I'll go poke around at that today and see what I can come up with. |
… threads This change fixes a bug where linux-sandbox actions that take longer than 30 seconds can be killed prematurely and cause the build to fail. There are two things that contribute to this bug: Part one: When using virtual threads with the JavaSubprocessFactory, the process is forked from the currently running virtual thread's carrier thread. The virtual thread then waits for the subprocess to complete, which causes the carrier thread to detach. If there is no other work for the carrier thread to perform, it becomes idle. If it is idle for long enough it can be killed by the ForkJoinPool it is a member of. The default virtual thread scheduler in JDK 25 uses a ForkJoinPool with an idle TTL of 30 seconds. Part two: The linux-sandbox process sends SIGKILL to the child process when its parent dies. This is setup like so: `prctl(PR_SET_PDEATHSIG, SIGKILL)`. These two things combined cause problems for Bazel builds if you: * Use the linux-sandbox strategy * Use virtual threads via --experimental_async_execution * Have an action which takes longer than 30 seconds to complete If the Bazel server isn't very busy while that 30 second action is running, the parent carrier thread can be killed due to inactivity, which causes the linux-sandbox process to die, which causes the build to fail. This is only an issue if you're using virtual threads. When using platform threads they directly wait on the subprocess to complete. The thread is considered busy while waiting, which prevents it from being killed. This change fixes the issue by always forking from a long-lived platform thread. That way there is no risk of the platform thread being killed prematurely. The fix does so while still enabling the linux-sandbox processes to be properly killed when the Bazel server is killed.
ee9c89e to
a414f28
Compare
|
I added a test to this PR that asserts the process builder is called from a virtual thread and the forking happens from the correct long-lived platform thread. I think this should be ready for folks to take another look. |
This change fixes a bug where
linux-sandboxactions that take longer than 30 seconds can be killed prematurely and cause the build to fail.I created a minimal repro case with instructions for this bug here: https://github.com/lucidsoftware/bazel_virtual_thread_repro
There are two things that contribute to this bug:
Part one:
When using virtual threads with the
JavaSubprocessFactory, the process is forked from the currently running virtual thread's carrier thread. The virtual thread then waits for the subprocess to complete, which causes the carrier thread to detach. If there is no other work for the carrier thread to perform, it becomes idle. If it is idle for long enough it can be killed by theForkJoinPoolit is a member of. The default virtual thread scheduler in JDK 25 uses aForkJoinPoolwith an idle TTL of 30 seconds.Part two:
The
linux-sandboxprocess sendsSIGKILLto the child process when its parent dies. This is setup like so:prctl(PR_SET_PDEATHSIG, SIGKILL).These two things combined cause problems for Bazel builds if you:
linux-sandboxstrategy--experimental_async_executionIf the Bazel server isn't very busy while that 30 second action is running, the parent carrier thread can be killed due to inactivity, which causes the
linux-sandboxprocess to die, which causes the build to fail.This is only an issue if you're using virtual threads. When using platform threads they directly wait on the subprocess to complete. The thread is considered busy while waiting, which prevents it from being killed.
This change fixes the issue by always forking from a long-lived platform thread. That way there is no risk of the platform thread being killed prematurely. The fix does so while still enabling the
linux-sandboxprocesses to be properly killed when the Bazel server is killed.