Skip to content

Replace process_stubs.c with pure OCaml implementation for OCaml 5 multi-domain safety#12757

Merged
Simn merged 17 commits intowhy-is-parallelism-disabledfrom
copilot/sub-pr-12747-again
Mar 7, 2026
Merged

Replace process_stubs.c with pure OCaml implementation for OCaml 5 multi-domain safety#12757
Simn merged 17 commits intowhy-is-parallelism-disabledfrom
copilot/sub-pr-12747-again

Conversation

Copy link
Contributor

Copilot AI commented Mar 6, 2026

The persistent domainslib pool (run_with_pool from sctx.pool) kept idle domain workers alive during eval interpretation, causing subprocess/IPC failures in CI (connect_stdin hangs, sourcemaps EOF, TestUnicode IPC corruption). Investigation revealed the root cause is in process_stubs.c, not the pool management.

Root Cause

process_stubs.c (the C implementation of eval's sys.io.Process) was written for single-threaded OCaml 4 and is unsafe in OCaml 5's multi-domain context:

  1. GC stop-the-world stall (all platforms): Blocking operations (process_exitwaitpid/WaitForSingleObject, process_stdout_readread/ReadFile, etc.) ran without caml_enter_blocking_section. The calling domain holds its runtime lock during the entire blocking call, no backup thread runs to handle GC interrupts, and when worker domains need major GC (stop-the-world), the main domain can't respond — all worker domains stall waiting for GC, causing pipe I/O timing issues and data corruption.

  2. fork() in multi-threaded process (Unix): process_run called fork() directly, bypassing OCaml 5's Unix.fork which explicitly refuses to fork in multi-domain programs (caml_domain_is_multicore() check in OCaml 5.3's fork.c). On Linux, fork() triggers glibc's pthread_atfork handlers which lock malloc arenas — if a worker thread holds any internal lock at that moment, fork deadlocks (explains the connect_stdin 20min hang with orphan processes).

Changes

Replaces process_stubs.c (685 lines of C) entirely with a pure OCaml process.ml (~130 lines) using Unix.create_process, which is domain-safe under OCaml 5.

  • libs/extc/process.ml: Rewritten from C external declarations to a pure OCaml implementation using Unix.create_process, Unix.pipe, Unix.read/Unix.write, Unix.waitpid, Unix.kill. Maintains the same API (run, read_stdout, read_stderr, write_stdin, close_stdin, exit, pid, close, kill). All Unix.Unix_error information is preserved: I/O operations include the error details (function, arg, message) in Failure exceptions for eval compatibility, and run writes the error message to the stderr pipe when process creation fails (e.g. command not found), returning exit code 127 instead of throwing — matching the old fork+exec behavior. The kill function includes a pid > 0 guard as defense-in-depth against signaling process group 0.
  • libs/extc/process_stubs.c: Deleted.
  • libs/extc/dune: Removed foreign_stubs for process_stubs, added unix library dependency.
  • src/compiler/pipeThings.ml: Refactored run_command to use Process.run instead of duplicating pipe+spawn logic.
  • tests/misc/projects/process-nonexistent/: New misc test verifying that non-existent commands don't throw exceptions, produce empty stdout, non-empty stderr (with error info), and exit code 127 — with cross-platform output assertion via .stdout file.
  • evalStdLib.ml/evalValue.ml/Process.hx: No changes — the Process module API is unchanged.
  • filters.ml: No changes — run_with_pool and parallel analyzer are kept as-is.
  • Data race fixes and the Parallel.enable := false removal from reset are preserved from the base branch.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…al execution

The persistent pool (via run_with_pool) kept domain workers alive during eval
interpretation and callbacks, causing subprocess/IPC failures in eval tests.
Revert to run_in_new_pool which tears down the pool after each parallel section,
matching the pattern that works on development. The sctx.pool and run_with_pool
remain available for generators (e.g. genjvm) that use them during code generation.

Also restore the EnableParallelAnalyzer guard since the analyzer was specifically
identified as leaking when run in parallel (#12414).

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix default behavior to enable parallelism Fix idle domain workers interfering with eval subprocess/IPC operations Mar 6, 2026
Copilot finished work on behalf of Simn March 6, 2026 15:54
@Simn
Copy link
Member

Simn commented Mar 6, 2026

You can't be serious...

@Simn
Copy link
Member

Simn commented Mar 6, 2026

@copilot Let me set a few constraints here:

  1. We want to keep using run_with_pool and not create new pools.
  2. The analyzer has been considered safe by the most recent investigation, so we want to keep parallelism enabled for it.
  3. Your actual task is to investigate WHY there are problems because there shouldn't be any. Please analyze properly what actually might be happening and why it goes wrong.

Copilot AI and others added 2 commits March 6, 2026 17:39
Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Two changes to fix eval subprocess failures when domainslib pool
domains are alive:

1. Add caml_enter_blocking_section/caml_leave_blocking_section to all
   blocking process operations (read, write, waitpid, WaitForSingleObject,
   ReadFile, WriteFile). Without these, the calling domain holds its
   runtime lock during blocking I/O, preventing stop-the-world GC from
   proceeding when worker domains need it. This causes GC stalls that
   lead to pipe I/O corruption and timing issues.

2. Replace fork()+exec() with posix_spawn() on Unix. OCaml 5's
   Unix.fork explicitly refuses to fork when domains are spawned
   (caml_domain_is_multicore check). process_stubs.c bypassed this
   safety check by calling fork() directly. In multi-threaded programs,
   fork() can deadlock when worker threads hold internal locks (malloc
   arenas, glibc internals). posix_spawn() avoids this by using
   clone(CLONE_VM|CLONE_VFORK) internally on Linux.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 6, 2026

@copilot Let me set a few constraints here:

  1. We want to keep using run_with_pool and not create new pools.
    ...

Reverted filters.ml to keep run_with_pool and parallel analyzer. The root cause is in process_stubs.c, not the pool management.

Two issues found:

  1. GC stop-the-world stall: All blocking process operations (waitpid, ReadFile, read, WaitForSingleObject, WriteFile, write) ran without caml_enter_blocking_section. When worker domains need major GC, they send stop-the-world interrupts to all domains, but the main domain can't respond while blocked in C code → all workers stall waiting for GC.

  2. fork() in multi-domain process: process_run called fork() directly, bypassing OCaml 5's Unix.fork which explicitly refuses to fork when domains are spawned (see caml_domain_is_multicore() check in OCaml 5.3's fork.c). On Linux, fork() triggers glibc's pthread_atfork handlers which lock malloc arenas — if a worker thread holds one → fork deadlock (explains the connect_stdin 20min hang with orphan processes).

Fix in commit b34f405:

  • Added caml_enter_blocking_section/caml_leave_blocking_section around all blocking operations, using temporary C buffers
  • Replaced fork()+exec() with posix_spawn() on Unix (uses clone(CLONE_VM|CLONE_VFORK) internally, avoiding the multi-threaded fork problem)

Copilot AI changed the title Fix idle domain workers interfering with eval subprocess/IPC operations Fix process_stubs.c for OCaml 5 multi-domain safety Mar 6, 2026
Copilot finished work on behalf of Simn March 6, 2026 17:47
@Simn
Copy link
Member

Simn commented Mar 6, 2026

@copilot This makes me wonder if we can just get rid of process_stubs.c entirely and have a proper OCaml implementation. There appears to be a lot of overlap with run_command, so I just went ahead and started refactoring that. Allow me to throw some OCaml code at you:

module HaxeProcess = struct
	type t = {
		pid : int;
		pin : out_channel;
		pout : in_channel;
		perr : in_channel;
		mutable stop_stdin : bool;
	}

	let create comm cmd =
		let (child_stdin_r, child_stdin_w) = Unix.pipe ~cloexec:true () in
		let (child_stdout_r, child_stdout_w) = Unix.pipe ~cloexec:true () in
		let (child_stderr_r, child_stderr_w) = Unix.pipe ~cloexec:true () in
		let shell, args =
			if Sys.win32 then
				"cmd.exe", [|"cmd.exe"; "/c"; cmd|]
			else
				"/bin/sh", [|"/bin/sh"; "-c"; cmd|]
		in
		let pid = Unix.create_process_env shell args (Unix.environment()) child_stdin_r child_stdout_w child_stderr_w in
		Unix.close child_stdin_r;
		Unix.close child_stdout_w;
		Unix.close child_stderr_w;
		let pin = Unix.out_channel_of_descr child_stdin_w in
		let pout = Unix.in_channel_of_descr child_stdout_r in
		let perr = Unix.in_channel_of_descr child_stderr_r in
		{
			pid;
			pin;
			pout;
			perr;
			stop_stdin = false;
		}

	let connect_stdin proc stdin_pipe =
		let stdin_fd = Unix.descr_of_in_channel stdin_pipe in
		Thread.create (fun () ->
			let buf = Bytes.create 1024 in
			(try while not proc.stop_stdin do
				let readable, _, _ = Unix.select [stdin_fd] [] [] 0.05 in
				if readable <> [] then begin
					let i = Unix.read stdin_fd buf 0 1024 in
					if i = 0 then raise Exit;
					output proc.pin buf 0 i;
					flush proc.pin
				end
			done with _ -> ());
			close_out_noerr proc.pin
		) ()

	let register_stdout proc f =
		let buf = Bytes.create 1024 in
		Thread.create (fun() -> read_content proc.pout buf f) ()

	let register_stderr proc f =
		let buf = Bytes.create 1024 in
		Thread.create (fun() -> read_content proc.perr buf f) ()
end

(** Runs a shell command in server mode, forwarding stdin from the client
	and capturing stdout/stderr through the socket protocol.
	Uses [Unix.create_process_env] (not [Sys.command]) so we can connect
	the child's stdin to the client's forwarded data and properly signal
	EOF when the client closes its end. *)
let run_command comm cmd =
	let open HaxeProcess in
	let {pid; pin; pout; perr;} as proc = HaxeProcess.create comm cmd in
	let tin = match comm.stdin with
		| Some stdin_pipe ->
			Some (HaxeProcess.connect_stdin proc stdin_pipe)
		| None ->
			close_out_noerr pin;
			None
	in
	let tout = HaxeProcess.register_stdout proc comm.write_out in
	let terr = HaxeProcess.register_stderr proc comm.write_err in
	Thread.join tout;
	Thread.join terr;
	close_in_noerr pout;
	close_in_noerr perr;
	let _, status = Unix.waitpid [] pid in
	proc.stop_stdin <- true;
	(match tin with Some t -> Thread.join t | None -> ());
	match status with Unix.WEXITED c | Unix.WSIGNALED c | Unix.WSTOPPED c -> c

Let's try to generalize this to the point where we can use it for both run_command and as a replacement for eval's StdNativeProcess. Note that this native process it not actually part of the public API, so we are free to entirely remove it. The only public API is sys.io.Process itself which should not change its interface. We could move its implementation completely into the compiler though.

Please see if that makes sense!

Copilot AI and others added 2 commits March 6, 2026 18:35
Remove the C stubs entirely and implement Process using
Unix.create_process, which is domain-safe under OCaml 5.
This also refactors PipeThings.run_command to use Process.run
instead of duplicating the pipe+spawn logic.

The old C stubs had two issues in OCaml 5:
1. Missing caml_enter_blocking_section around blocking I/O
2. Direct fork() which is unsafe with multiple domains

Unix.create_process handles both correctly (it uses posix_spawn
internally on modern systems and properly releases the runtime lock).

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
…tibility

The old C stubs used caml_failwith for all errors, which eval catches
as Dynamic. Unix.create_process raises Unix.Unix_error instead, which
eval couldn't catch. Convert all Unix errors to Failure to preserve
backward compatibility.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 6, 2026

@copilot This makes me wonder if we can just get rid of process_stubs.c entirely and have a proper OCaml implementation. There appears to be a lot of overlap with run_command, so I...

Implemented in 2dc8ae7 and 6c8f0c0. Replaced process_stubs.c entirely with a pure OCaml process.ml (~85 lines vs 685 lines of C) using Unix.create_process.

Changes:

  • libs/extc/process.ml: Pure OCaml implementation, same API (run, read_stdout, read_stderr, write_stdin, close_stdin, exit, pid, close, kill)
  • libs/extc/process_stubs.c: Deleted
  • src/compiler/pipeThings.ml: run_command now uses Process.run instead of duplicating the pipe+spawn logic
  • evalStdLib.ml/evalValue.ml/Process.hx: No changes needed — API is unchanged

One thing I had to handle: Unix.create_process raises Unix.Unix_error instead of Failure, but eval's catch(e:Dynamic) only catches Failure. So all Unix.Unix_error exceptions are converted to Failure in the Process module.

Full RunCi macro tests pass (unit, misc, display, sourcemaps, null-safety, sys, connect_stdin, threads, third-party).

Copilot AI changed the title Fix process_stubs.c for OCaml 5 multi-domain safety Replace process_stubs.c with pure OCaml implementation for OCaml 5 multi-domain safety Mar 6, 2026
Copilot finished work on behalf of Simn March 6, 2026 18:45
@Simn
Copy link
Member

Simn commented Mar 6, 2026

This is a nice Claude redemption arc after that initial failure. Getting rid of that smelly old C code is really good, and if it solves parallelism problems that's even better. I'm not convinced that it's really the root cause but I did also notice the relationship to processes in particular.

@Simn
Copy link
Member

Simn commented Mar 6, 2026

Also that Eof failure is a pre-existing problem: because the class didn't extend Exception, a catch(e:Eof) wouldn't work after Exception.caught wrapped it in ValueException. It probably rarely ever surfaced because we didn't close our pipes properly before.

@Simn
Copy link
Member

Simn commented Mar 6, 2026

@copilot The last CI failure is now the PHP tests on Mac. The offending code is this:

	static public function commandResult(cmd:String, args:Array<String>):{
		stdout:String,
		stderr:String,
		exitCode:Int
	} {
		final p = new Process(cmd, args);
		final out = {
			stdout: p.stdout.readAll().toString(),
			stderr: p.stderr.readAll().toString(),
			exitCode: p.exitCode()
		}
		p.close();
		return out;
	}

final phpCmd = commandResult("php", ["-v"]);

php isn't installed on Mac so this fails, which is expected, but what is unexpected is that it throws an exception:

Going to test: [php]
test php
Changing directory to /Users/runner/work/haxe/haxe/tests/unit
Command: haxelib [git,utest,https://github.com/haxe-utest/utest]
Installing utest from https://github.com/haxe-utest/utest
Cloning Git from https://github.com/haxe-utest/utest
Syncing submodules for Git
Downloading/updating submodules for Git
  Current version is now git
Done
Command exited with 0 in 2s: haxelib [git,utest,https://github.com/haxe-utest/utest]
Error:  RunCi.hx

    | Uncaught exception create_process(php): No such file or directory

Error: Process completed with exit code 1.

We're supposed to catch this in evalStdLib:3417 via let proc = process_catch (fun () -> Process.run cmd args) () in, but apparently we don't. Please see if you can reproduce that and figure out what's going on, then let's add a misc test to test that we don't get exceptions from non-existent commands and can instead test for an exit code != 0.

Copilot AI and others added 3 commits March 6, 2026 21:28
When Unix.create_process fails (e.g. command not found), return a
process with pre-closed pipes (reads return EOF) and exit code 127,
matching the old fork+exec behavior where fork always succeeded and
exec failure manifested as a non-zero exit code.

Add misc test to verify non-existent commands don't throw exceptions.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
The test now prints stdout_empty, stderr_empty, and exit_code values
and asserts them via compile.hxml.stdout (expected: empty pipes, code 127).

Also add pid > 0 guard in Process.kill as defense-in-depth against
sending signals to process group 0 for failed process objects.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
- run: writes error message to stderr pipe so callers can read it
- read_stdout/read_stderr/write_stdin/close_stdin: include the
  Unix error details (function, arg, message) in Failure exceptions
- Add unix_error_msg helper shared by all operations
- Test checks stderr is non-empty (not a specific message string)
  for cross-platform compatibility

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Copilot finished work on behalf of Simn March 6, 2026 21:35
@Simn Simn marked this pull request as ready for review March 7, 2026 05:14
@Simn Simn merged commit ddf1aaa into why-is-parallelism-disabled Mar 7, 2026
195 of 197 checks passed
@Simn Simn deleted the copilot/sub-pr-12747-again branch March 7, 2026 05:55
Simn added a commit that referenced this pull request Mar 7, 2026
* why...

* add sctx.pool

* Fix pool usage, enable parallel analyzer, fix data races

- Add Parallel.run_with_pool to wrap operations in Task.run (fixes
  Unhandled(Wait) exception from missing effect handler)
- Update all call sites to use run_with_pool instead of bare Lazy.force
- Enable parallel analyzer by default (remove opt-in gate)
- Make apply_params_stack domain-local via Domain.DLS (fixes data race)
- Make alloc_mid use Atomic (consistency with alloc_var)
- Protect DCE CfMaybeUsed writes with field_marker_mutex (fixes flag race)

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* move apply_params_stack to uctx

* investigate if we're actually looking at the problem

* revert

* focus

* focus more

* Replace process_stubs.c with pure OCaml implementation for OCaml 5 multi-domain safety (#12757)

* Initial plan

* Use run_in_new_pool for filter phases to avoid idle domains during eval execution

The persistent pool (via run_with_pool) kept domain workers alive during eval
interpretation and callbacks, causing subprocess/IPC failures in eval tests.
Revert to run_in_new_pool which tears down the pool after each parallel section,
matching the pattern that works on development. The sctx.pool and run_with_pool
remain available for generators (e.g. genjvm) that use them during code generation.

Also restore the EnableParallelAnalyzer guard since the analyzer was specifically
identified as leaking when run in parallel (#12414).

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Revert filters.ml to base branch state

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Make process_stubs.c safe for OCaml 5 multi-domain programs

Two changes to fix eval subprocess failures when domainslib pool
domains are alive:

1. Add caml_enter_blocking_section/caml_leave_blocking_section to all
   blocking process operations (read, write, waitpid, WaitForSingleObject,
   ReadFile, WriteFile). Without these, the calling domain holds its
   runtime lock during blocking I/O, preventing stop-the-world GC from
   proceeding when worker domains need it. This causes GC stalls that
   lead to pipe I/O corruption and timing issues.

2. Replace fork()+exec() with posix_spawn() on Unix. OCaml 5's
   Unix.fork explicitly refuses to fork when domains are spawned
   (caml_domain_is_multicore check). process_stubs.c bypassed this
   safety check by calling fork() directly. In multi-threaded programs,
   fork() can deadlock when worker threads hold internal locks (malloc
   arenas, glibc internals). posix_spawn() avoids this by using
   clone(CLONE_VM|CLONE_VFORK) internally on Linux.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Replace process_stubs.c with pure OCaml implementation

Remove the C stubs entirely and implement Process using
Unix.create_process, which is domain-safe under OCaml 5.
This also refactors PipeThings.run_command to use Process.run
instead of duplicating the pipe+spawn logic.

The old C stubs had two issues in OCaml 5:
1. Missing caml_enter_blocking_section around blocking I/O
2. Direct fork() which is unsafe with multiple domains

Unix.create_process handles both correctly (it uses posix_spawn
internally on modern systems and properly releases the runtime lock).

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Fix error handling: convert Unix.Unix_error to Failure for eval compatibility

The old C stubs used caml_failwith for all errors, which eval catches
as Dynamic. Unix.create_process raises Unix.Unix_error instead, which
eval couldn't catch. Convert all Unix errors to Failure to preserve
backward compatibility.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* catch Unix_error in callers instead

* don't turn exit status into exceptions

* make Eof extend Exception

* catch constructor exception too

* back to running everything

* hmm

* Handle non-existent commands in Process.run without throwing

When Unix.create_process fails (e.g. command not found), return a
process with pre-closed pipes (reads return EOF) and exit code 127,
matching the old fork+exec behavior where fork always succeeded and
exec failure manifested as a non-zero exit code.

Add misc test to verify non-existent commands don't throw exceptions.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Assert output in process-nonexistent misc test, guard kill against pid=0

The test now prints stdout_empty, stderr_empty, and exit_code values
and asserts them via compile.hxml.stdout (expected: empty pipes, code 127).

Also add pid > 0 guard in Process.kill as defense-in-depth against
sending signals to process group 0 for failed process objects.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Preserve Unix.Unix_error info in all Process operations

- run: writes error message to stderr pipe so callers can read it
- read_stdout/read_stderr/write_stdin/close_stdin: include the
  Unix error details (function, arg, message) in Failure exceptions
- Add unix_error_msg helper shared by all operations
- Test checks stderr is non-empty (not a specific message string)
  for cross-platform compatibility

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* reformat

* avoid some code duplication

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Co-authored-by: Simon Krajewski <simon@haxe.org>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants