Skip to content

Enable parallelism again#12747

Merged
Simn merged 12 commits intodevelopmentfrom
why-is-parallelism-disabled
Mar 7, 2026
Merged

Enable parallelism again#12747
Simn merged 12 commits intodevelopmentfrom
why-is-parallelism-disabled

Conversation

@Simn
Copy link
Member

@Simn Simn commented Mar 5, 2026

This was randomly disabled in some ninja commit last year and I never noticed it. We even added -D disable-parallelism to optionally disable it, so I'm a little confused by it's disabled by default. But either way, let's get this working properly.

@Simn
Copy link
Member Author

Simn commented Mar 5, 2026

@copilot Please survey all the Parallel occurrences in filters.ml and trace through the code to find potential data races. Also, we've experienced some memory leaks in the past, but it is unclear if that was our fault or an upstream problem.

Copy link
Contributor

Copilot AI commented Mar 5, 2026

@Simn I've opened a new pull request, #12748, to work on those changes. Once the pull request is ready, I'll request review from you.

@Simn Simn force-pushed the why-is-parallelism-disabled branch from 38c78d6 to ade2f8c Compare March 5, 2026 20:57
@Simn Simn force-pushed the why-is-parallelism-disabled branch from b123bb1 to a7bee8b Compare March 5, 2026 21:03
Copilot AI and others added 2 commits March 5, 2026 22:34
- Add Parallel.run_with_pool to wrap operations in Task.run (fixes
  Unhandled(Wait) exception from missing effect handler)
- Update all call sites to use run_with_pool instead of bare Lazy.force
- Enable parallel analyzer by default (remove opt-in gate)
- Make apply_params_stack domain-local via Domain.DLS (fixes data race)
- Make alloc_mid use Atomic (consistency with alloc_var)
- Protect DCE CfMaybeUsed writes with field_marker_mutex (fixes flag race)

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
@Simn
Copy link
Member Author

Simn commented Mar 5, 2026

@copilot I remembered (well, CI reminded me) that there was still a problem with domains and Eval. There's a comment about this in filters.ml:

	(* Note: We cannot have a thread pool up during the before/after_save callbacks because Eval's thread handling
	   currently does not get along with it. This is why we need a separate pool for this operation. *)

Unfortunately, I don't remember what the actual root cause was and why I concluded that there's an interference. Mixing threads and domain is probably not a particularly good idea, but at the same time I can't tell why it wouldn't work. Could you check if you can identify the problem?

And while you're there, I would like to have your assessment on whether or not it would be feasible to make Eval use domains instead of threads for its vthread.

Copy link
Contributor

Copilot AI commented Mar 5, 2026

@Simn I've opened a new pull request, #12751, to work on those changes. Once the pull request is ready, I'll request review from you.

Simn added a commit that referenced this pull request Mar 6, 2026
@Simn
Copy link
Member Author

Simn commented Mar 6, 2026

@copilot This PR aims to enable parallelism for the filtering stage of the compiler again. We have found that there is a problem with the eval target in particular which we haven't been able to track down yet. Here are examples of failing CI runs:

https://github.com/HaxeFoundation/haxe/actions/runs/22768814450/job/66045252148?pr=12747
https://github.com/HaxeFoundation/haxe/actions/runs/22768814450/job/66044060318?pr=12747
https://github.com/HaxeFoundation/haxe/actions/runs/22768814450/job/66047042680?pr=12747

Note: the actual Eval run-time is not affected by parallelism and still runs single-threaded. Also, note that these particular failures never occured on the development branch, so they really do seem related to the changes here.

Copy link
Contributor

Copilot AI commented Mar 6, 2026

@Simn I've opened a new pull request, #12757, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits March 7, 2026 06:55
…lti-domain safety (#12757)

* Initial plan

* Use run_in_new_pool for filter phases to avoid idle domains during eval execution

The persistent pool (via run_with_pool) kept domain workers alive during eval
interpretation and callbacks, causing subprocess/IPC failures in eval tests.
Revert to run_in_new_pool which tears down the pool after each parallel section,
matching the pattern that works on development. The sctx.pool and run_with_pool
remain available for generators (e.g. genjvm) that use them during code generation.

Also restore the EnableParallelAnalyzer guard since the analyzer was specifically
identified as leaking when run in parallel (#12414).

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Revert filters.ml to base branch state

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Make process_stubs.c safe for OCaml 5 multi-domain programs

Two changes to fix eval subprocess failures when domainslib pool
domains are alive:

1. Add caml_enter_blocking_section/caml_leave_blocking_section to all
   blocking process operations (read, write, waitpid, WaitForSingleObject,
   ReadFile, WriteFile). Without these, the calling domain holds its
   runtime lock during blocking I/O, preventing stop-the-world GC from
   proceeding when worker domains need it. This causes GC stalls that
   lead to pipe I/O corruption and timing issues.

2. Replace fork()+exec() with posix_spawn() on Unix. OCaml 5's
   Unix.fork explicitly refuses to fork when domains are spawned
   (caml_domain_is_multicore check). process_stubs.c bypassed this
   safety check by calling fork() directly. In multi-threaded programs,
   fork() can deadlock when worker threads hold internal locks (malloc
   arenas, glibc internals). posix_spawn() avoids this by using
   clone(CLONE_VM|CLONE_VFORK) internally on Linux.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Replace process_stubs.c with pure OCaml implementation

Remove the C stubs entirely and implement Process using
Unix.create_process, which is domain-safe under OCaml 5.
This also refactors PipeThings.run_command to use Process.run
instead of duplicating the pipe+spawn logic.

The old C stubs had two issues in OCaml 5:
1. Missing caml_enter_blocking_section around blocking I/O
2. Direct fork() which is unsafe with multiple domains

Unix.create_process handles both correctly (it uses posix_spawn
internally on modern systems and properly releases the runtime lock).

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Fix error handling: convert Unix.Unix_error to Failure for eval compatibility

The old C stubs used caml_failwith for all errors, which eval catches
as Dynamic. Unix.create_process raises Unix.Unix_error instead, which
eval couldn't catch. Convert all Unix errors to Failure to preserve
backward compatibility.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* catch Unix_error in callers instead

* don't turn exit status into exceptions

* make Eof extend Exception

* catch constructor exception too

* back to running everything

* hmm

* Handle non-existent commands in Process.run without throwing

When Unix.create_process fails (e.g. command not found), return a
process with pre-closed pipes (reads return EOF) and exit code 127,
matching the old fork+exec behavior where fork always succeeded and
exec failure manifested as a non-zero exit code.

Add misc test to verify non-existent commands don't throw exceptions.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Assert output in process-nonexistent misc test, guard kill against pid=0

The test now prints stdout_empty, stderr_empty, and exit_code values
and asserts them via compile.hxml.stdout (expected: empty pipes, code 127).

Also add pid > 0 guard in Process.kill as defense-in-depth against
sending signals to process group 0 for failed process objects.

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* Preserve Unix.Unix_error info in all Process operations

- run: writes error message to stderr pipe so callers can read it
- read_stdout/read_stderr/write_stdin/close_stdin: include the
  Unix error details (function, arg, message) in Failure exceptions
- Add unix_error_msg helper shared by all operations
- Test checks stderr is non-empty (not a specific message string)
  for cross-platform compatibility

Co-authored-by: Simn <634365+Simn@users.noreply.github.com>

* reformat

* avoid some code duplication

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Co-authored-by: Simon Krajewski <simon@haxe.org>
@Simn Simn marked this pull request as ready for review March 7, 2026 07:41
@Simn
Copy link
Member Author

Simn commented Mar 7, 2026

Looks like the damn process thing really was the root cause of these weird internal issues we've been seeing. I'm not sure if this also resolves the memory leak but I suppose there's only one way to find out.

@Simn Simn merged commit cabfd42 into development Mar 7, 2026
99 checks passed
@Simn Simn deleted the why-is-parallelism-disabled branch March 7, 2026 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants