Replace process_stubs.c with pure OCaml implementation for OCaml 5 multi-domain safety#12757
Conversation
…al execution The persistent pool (via run_with_pool) kept domain workers alive during eval interpretation and callbacks, causing subprocess/IPC failures in eval tests. Revert to run_in_new_pool which tears down the pool after each parallel section, matching the pattern that works on development. The sctx.pool and run_with_pool remain available for generators (e.g. genjvm) that use them during code generation. Also restore the EnableParallelAnalyzer guard since the analyzer was specifically identified as leaking when run in parallel (#12414). Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
|
You can't be serious... |
|
@copilot Let me set a few constraints here:
|
Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Two changes to fix eval subprocess failures when domainslib pool domains are alive: 1. Add caml_enter_blocking_section/caml_leave_blocking_section to all blocking process operations (read, write, waitpid, WaitForSingleObject, ReadFile, WriteFile). Without these, the calling domain holds its runtime lock during blocking I/O, preventing stop-the-world GC from proceeding when worker domains need it. This causes GC stalls that lead to pipe I/O corruption and timing issues. 2. Replace fork()+exec() with posix_spawn() on Unix. OCaml 5's Unix.fork explicitly refuses to fork when domains are spawned (caml_domain_is_multicore check). process_stubs.c bypassed this safety check by calling fork() directly. In multi-threaded programs, fork() can deadlock when worker threads hold internal locks (malloc arenas, glibc internals). posix_spawn() avoids this by using clone(CLONE_VM|CLONE_VFORK) internally on Linux. Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Reverted filters.ml to keep Two issues found:
Fix in commit b34f405:
|
|
@copilot This makes me wonder if we can just get rid of module HaxeProcess = struct
type t = {
pid : int;
pin : out_channel;
pout : in_channel;
perr : in_channel;
mutable stop_stdin : bool;
}
let create comm cmd =
let (child_stdin_r, child_stdin_w) = Unix.pipe ~cloexec:true () in
let (child_stdout_r, child_stdout_w) = Unix.pipe ~cloexec:true () in
let (child_stderr_r, child_stderr_w) = Unix.pipe ~cloexec:true () in
let shell, args =
if Sys.win32 then
"cmd.exe", [|"cmd.exe"; "/c"; cmd|]
else
"/bin/sh", [|"/bin/sh"; "-c"; cmd|]
in
let pid = Unix.create_process_env shell args (Unix.environment()) child_stdin_r child_stdout_w child_stderr_w in
Unix.close child_stdin_r;
Unix.close child_stdout_w;
Unix.close child_stderr_w;
let pin = Unix.out_channel_of_descr child_stdin_w in
let pout = Unix.in_channel_of_descr child_stdout_r in
let perr = Unix.in_channel_of_descr child_stderr_r in
{
pid;
pin;
pout;
perr;
stop_stdin = false;
}
let connect_stdin proc stdin_pipe =
let stdin_fd = Unix.descr_of_in_channel stdin_pipe in
Thread.create (fun () ->
let buf = Bytes.create 1024 in
(try while not proc.stop_stdin do
let readable, _, _ = Unix.select [stdin_fd] [] [] 0.05 in
if readable <> [] then begin
let i = Unix.read stdin_fd buf 0 1024 in
if i = 0 then raise Exit;
output proc.pin buf 0 i;
flush proc.pin
end
done with _ -> ());
close_out_noerr proc.pin
) ()
let register_stdout proc f =
let buf = Bytes.create 1024 in
Thread.create (fun() -> read_content proc.pout buf f) ()
let register_stderr proc f =
let buf = Bytes.create 1024 in
Thread.create (fun() -> read_content proc.perr buf f) ()
end
(** Runs a shell command in server mode, forwarding stdin from the client
and capturing stdout/stderr through the socket protocol.
Uses [Unix.create_process_env] (not [Sys.command]) so we can connect
the child's stdin to the client's forwarded data and properly signal
EOF when the client closes its end. *)
let run_command comm cmd =
let open HaxeProcess in
let {pid; pin; pout; perr;} as proc = HaxeProcess.create comm cmd in
let tin = match comm.stdin with
| Some stdin_pipe ->
Some (HaxeProcess.connect_stdin proc stdin_pipe)
| None ->
close_out_noerr pin;
None
in
let tout = HaxeProcess.register_stdout proc comm.write_out in
let terr = HaxeProcess.register_stderr proc comm.write_err in
Thread.join tout;
Thread.join terr;
close_in_noerr pout;
close_in_noerr perr;
let _, status = Unix.waitpid [] pid in
proc.stop_stdin <- true;
(match tin with Some t -> Thread.join t | None -> ());
match status with Unix.WEXITED c | Unix.WSIGNALED c | Unix.WSTOPPED c -> cLet's try to generalize this to the point where we can use it for both Please see if that makes sense! |
Remove the C stubs entirely and implement Process using Unix.create_process, which is domain-safe under OCaml 5. This also refactors PipeThings.run_command to use Process.run instead of duplicating the pipe+spawn logic. The old C stubs had two issues in OCaml 5: 1. Missing caml_enter_blocking_section around blocking I/O 2. Direct fork() which is unsafe with multiple domains Unix.create_process handles both correctly (it uses posix_spawn internally on modern systems and properly releases the runtime lock). Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
…tibility The old C stubs used caml_failwith for all errors, which eval catches as Dynamic. Unix.create_process raises Unix.Unix_error instead, which eval couldn't catch. Convert all Unix errors to Failure to preserve backward compatibility. Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Implemented in 2dc8ae7 and 6c8f0c0. Replaced Changes:
One thing I had to handle: Full RunCi macro tests pass (unit, misc, display, sourcemaps, null-safety, sys, connect_stdin, threads, third-party). |
|
This is a nice Claude redemption arc after that initial failure. Getting rid of that smelly old C code is really good, and if it solves parallelism problems that's even better. I'm not convinced that it's really the root cause but I did also notice the relationship to processes in particular. |
|
Also that |
|
@copilot The last CI failure is now the PHP tests on Mac. The offending code is this: static public function commandResult(cmd:String, args:Array<String>):{
stdout:String,
stderr:String,
exitCode:Int
} {
final p = new Process(cmd, args);
final out = {
stdout: p.stdout.readAll().toString(),
stderr: p.stderr.readAll().toString(),
exitCode: p.exitCode()
}
p.close();
return out;
}
final phpCmd = commandResult("php", ["-v"]);
We're supposed to catch this in evalStdLib:3417 via |
When Unix.create_process fails (e.g. command not found), return a process with pre-closed pipes (reads return EOF) and exit code 127, matching the old fork+exec behavior where fork always succeeded and exec failure manifested as a non-zero exit code. Add misc test to verify non-existent commands don't throw exceptions. Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
The test now prints stdout_empty, stderr_empty, and exit_code values and asserts them via compile.hxml.stdout (expected: empty pipes, code 127). Also add pid > 0 guard in Process.kill as defense-in-depth against sending signals to process group 0 for failed process objects. Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
- run: writes error message to stderr pipe so callers can read it - read_stdout/read_stderr/write_stdin/close_stdin: include the Unix error details (function, arg, message) in Failure exceptions - Add unix_error_msg helper shared by all operations - Test checks stderr is non-empty (not a specific message string) for cross-platform compatibility Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
* why... * add sctx.pool * Fix pool usage, enable parallel analyzer, fix data races - Add Parallel.run_with_pool to wrap operations in Task.run (fixes Unhandled(Wait) exception from missing effect handler) - Update all call sites to use run_with_pool instead of bare Lazy.force - Enable parallel analyzer by default (remove opt-in gate) - Make apply_params_stack domain-local via Domain.DLS (fixes data race) - Make alloc_mid use Atomic (consistency with alloc_var) - Protect DCE CfMaybeUsed writes with field_marker_mutex (fixes flag race) Co-authored-by: Simn <634365+Simn@users.noreply.github.com> * move apply_params_stack to uctx * investigate if we're actually looking at the problem * revert * focus * focus more * Replace process_stubs.c with pure OCaml implementation for OCaml 5 multi-domain safety (#12757) * Initial plan * Use run_in_new_pool for filter phases to avoid idle domains during eval execution The persistent pool (via run_with_pool) kept domain workers alive during eval interpretation and callbacks, causing subprocess/IPC failures in eval tests. Revert to run_in_new_pool which tears down the pool after each parallel section, matching the pattern that works on development. The sctx.pool and run_with_pool remain available for generators (e.g. genjvm) that use them during code generation. Also restore the EnableParallelAnalyzer guard since the analyzer was specifically identified as leaking when run in parallel (#12414). Co-authored-by: Simn <634365+Simn@users.noreply.github.com> * Revert filters.ml to base branch state Co-authored-by: Simn <634365+Simn@users.noreply.github.com> * Make process_stubs.c safe for OCaml 5 multi-domain programs Two changes to fix eval subprocess failures when domainslib pool domains are alive: 1. Add caml_enter_blocking_section/caml_leave_blocking_section to all blocking process operations (read, write, waitpid, WaitForSingleObject, ReadFile, WriteFile). Without these, the calling domain holds its runtime lock during blocking I/O, preventing stop-the-world GC from proceeding when worker domains need it. This causes GC stalls that lead to pipe I/O corruption and timing issues. 2. Replace fork()+exec() with posix_spawn() on Unix. OCaml 5's Unix.fork explicitly refuses to fork when domains are spawned (caml_domain_is_multicore check). process_stubs.c bypassed this safety check by calling fork() directly. In multi-threaded programs, fork() can deadlock when worker threads hold internal locks (malloc arenas, glibc internals). posix_spawn() avoids this by using clone(CLONE_VM|CLONE_VFORK) internally on Linux. Co-authored-by: Simn <634365+Simn@users.noreply.github.com> * Replace process_stubs.c with pure OCaml implementation Remove the C stubs entirely and implement Process using Unix.create_process, which is domain-safe under OCaml 5. This also refactors PipeThings.run_command to use Process.run instead of duplicating the pipe+spawn logic. The old C stubs had two issues in OCaml 5: 1. Missing caml_enter_blocking_section around blocking I/O 2. Direct fork() which is unsafe with multiple domains Unix.create_process handles both correctly (it uses posix_spawn internally on modern systems and properly releases the runtime lock). Co-authored-by: Simn <634365+Simn@users.noreply.github.com> * Fix error handling: convert Unix.Unix_error to Failure for eval compatibility The old C stubs used caml_failwith for all errors, which eval catches as Dynamic. Unix.create_process raises Unix.Unix_error instead, which eval couldn't catch. Convert all Unix errors to Failure to preserve backward compatibility. Co-authored-by: Simn <634365+Simn@users.noreply.github.com> * catch Unix_error in callers instead * don't turn exit status into exceptions * make Eof extend Exception * catch constructor exception too * back to running everything * hmm * Handle non-existent commands in Process.run without throwing When Unix.create_process fails (e.g. command not found), return a process with pre-closed pipes (reads return EOF) and exit code 127, matching the old fork+exec behavior where fork always succeeded and exec failure manifested as a non-zero exit code. Add misc test to verify non-existent commands don't throw exceptions. Co-authored-by: Simn <634365+Simn@users.noreply.github.com> * Assert output in process-nonexistent misc test, guard kill against pid=0 The test now prints stdout_empty, stderr_empty, and exit_code values and asserts them via compile.hxml.stdout (expected: empty pipes, code 127). Also add pid > 0 guard in Process.kill as defense-in-depth against sending signals to process group 0 for failed process objects. Co-authored-by: Simn <634365+Simn@users.noreply.github.com> * Preserve Unix.Unix_error info in all Process operations - run: writes error message to stderr pipe so callers can read it - read_stdout/read_stderr/write_stdin/close_stdin: include the Unix error details (function, arg, message) in Failure exceptions - Add unix_error_msg helper shared by all operations - Test checks stderr is non-empty (not a specific message string) for cross-platform compatibility Co-authored-by: Simn <634365+Simn@users.noreply.github.com> * reformat * avoid some code duplication --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Simn <634365+Simn@users.noreply.github.com> Co-authored-by: Simon Krajewski <simon@haxe.org> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
The persistent domainslib pool (
run_with_poolfromsctx.pool) kept idle domain workers alive during eval interpretation, causing subprocess/IPC failures in CI (connect_stdin hangs, sourcemaps EOF, TestUnicode IPC corruption). Investigation revealed the root cause is inprocess_stubs.c, not the pool management.Root Cause
process_stubs.c(the C implementation of eval'ssys.io.Process) was written for single-threaded OCaml 4 and is unsafe in OCaml 5's multi-domain context:GC stop-the-world stall (all platforms): Blocking operations (
process_exit→waitpid/WaitForSingleObject,process_stdout_read→read/ReadFile, etc.) ran withoutcaml_enter_blocking_section. The calling domain holds its runtime lock during the entire blocking call, no backup thread runs to handle GC interrupts, and when worker domains need major GC (stop-the-world), the main domain can't respond — all worker domains stall waiting for GC, causing pipe I/O timing issues and data corruption.fork() in multi-threaded process (Unix):
process_runcalledfork()directly, bypassing OCaml 5'sUnix.forkwhich explicitly refuses to fork in multi-domain programs (caml_domain_is_multicore()check in OCaml 5.3'sfork.c). On Linux,fork()triggers glibc'spthread_atforkhandlers which lock malloc arenas — if a worker thread holds any internal lock at that moment, fork deadlocks (explains the connect_stdin 20min hang with orphan processes).Changes
Replaces
process_stubs.c(685 lines of C) entirely with a pure OCamlprocess.ml(~130 lines) usingUnix.create_process, which is domain-safe under OCaml 5.libs/extc/process.ml: Rewritten from C external declarations to a pure OCaml implementation usingUnix.create_process,Unix.pipe,Unix.read/Unix.write,Unix.waitpid,Unix.kill. Maintains the same API (run,read_stdout,read_stderr,write_stdin,close_stdin,exit,pid,close,kill). AllUnix.Unix_errorinformation is preserved: I/O operations include the error details (function, arg, message) inFailureexceptions for eval compatibility, andrunwrites the error message to the stderr pipe when process creation fails (e.g. command not found), returning exit code 127 instead of throwing — matching the old fork+exec behavior. Thekillfunction includes apid > 0guard as defense-in-depth against signaling process group 0.libs/extc/process_stubs.c: Deleted.libs/extc/dune: Removedforeign_stubsforprocess_stubs, addedunixlibrary dependency.src/compiler/pipeThings.ml: Refactoredrun_commandto useProcess.runinstead of duplicating pipe+spawn logic.tests/misc/projects/process-nonexistent/: New misc test verifying that non-existent commands don't throw exceptions, produce empty stdout, non-empty stderr (with error info), and exit code 127 — with cross-platform output assertion via.stdoutfile.evalStdLib.ml/evalValue.ml/Process.hx: No changes — theProcessmodule API is unchanged.filters.ml: No changes —run_with_pooland parallel analyzer are kept as-is.Parallel.enable := falseremoval fromresetare preserved from the base branch.💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.