Skip to content

Intermittent hang after all tests complete: execnet _thread_receiver blocked on dead worker pipes (Python 3.14t, --dist loadfile) #1313

@clemlesne

Description

@clemlesne

Summary

pytest-xdist 3.8.0 intermittently hangs after all tests have passed when using --dist loadfile with 10 workers on Python 3.14t (free-threaded). The main thread loops forever in dsession.loop_once()queue.get(timeout=2.0)Empty → retry, because _active_nodes is never emptied. Meanwhile, 10 execnet _thread_receiver threads are stuck in gateway_base.read() on dead worker pipes, so they never report worker death events to unregister nodes.

Environment

  • Python: 3.14.2 free-threading build (cpython-3.14.2+freethreaded-macos-aarch64-none)
  • pytest: 8.4.2
  • pytest-xdist: 3.8.0
  • execnet: 2.1.2
  • OS: macOS 26.3 (Darwin 25.3.0, Apple Silicon)
  • GIL: Irrelevant — reproduces with both PYTHON_GIL=0 (default) and PYTHON_GIL=1

Configuration

```toml

pyproject.toml

[tool.pytest.ini_options]
addopts = ["-n", "auto", "--dist", "loadfile"]
```

Reproduction

```bash

~50% reproduction rate on a 710-test suite

Individual test files never hang — only the full suite

uv run pytest tests/ -m "unit and not integration" --no-cov -q

Workaround: serial mode always passes

uv run pytest tests/ -m "unit and not integration" --no-cov -q -n 0
```

The hang occurs at ~96% completion (after ~680/710 tests have passed). Progress output stops and pytest never exits. CPU usage drops to 0%.

Thread dump (faulthandler)

Captured via `faulthandler.dump_traceback_later(120)`:

Main thread — caught during `queue.get(timeout=2.0)` wait inside the infinite retry loop. The loop never exits because `_active_nodes` is never emptied (workers aren't unregistered since their receiver threads are stuck):

```
Thread 0x0000000200b07080 (most recent call first):
File "threading.py", line 373 in wait
File "queue.py", line 210 in get
File "xdist/dsession.py", line 154 in loop_once
File "xdist/dsession.py", line 138 in pytest_runtestloop
File "pluggy/_callers.py", line 121 in _multicall
File "_pytest/main.py", line 343 in _main
```

10 receiver threads — all identical, stuck in blocking `read()` on dead worker pipes:

```
Thread 0x000000017654b000 (most recent call first):
File "execnet/gateway_base.py", line 534 in read
File "execnet/gateway_base.py", line 567 in from_io
File "execnet/gateway_base.py", line 1160 in _thread_receiver
File "execnet/gateway_base.py", line 341 in run
File "execnet/gateway_base.py", line 411 in _perform_spawn
```

(All 10 worker threads show the same trace — `_thread_receiver` → `from_io` → `read`)

macOS `sample` trace

Confirms the same via native profiling:

```
lock_PyThread_acquire_lock (in libpython3.14t.dylib) + 60
_PyMutex_LockTimed (in libpython3.14t.dylib) + 880
_pthread_cond_wait (in libsystem_pthread.dylib) + 1028
__psynch_cvwait (in libsystem_kernel.dylib) + 8
```

Analysis

The worker subprocesses have finished and exited, but execnet's `_thread_receiver` threads remain blocked on `gateway_base.py:534 read()` — a blocking read from the worker's pipe that never returns EOF. Since these threads never detect worker death, they never fire the shutdown event that would remove the node from `dsession._active_nodes`. The main thread's `loop_once()` keeps retrying `queue.get(timeout=2.0)` → `Empty` → checks `_active_nodes` (still populated) → loops forever.

This is a race condition in worker cleanup: if a worker subprocess exits and closes its pipe in a way that the OS doesn't deliver EOF to the parent process's `read()` call, the receiver thread blocks indefinitely.

The `dsession.loop_once()` while-loop at line 148 cannot break out because:

  1. `_active_nodes` is never emptied (workers not unregistered)
  2. `queue.get(timeout=2.0)` always raises `Empty` (no events arriving)
  3. The `continue` restarts the loop

Related issues

  • execnet #43 — "Test process hanging forever" — same class of bug (`waitall()` without timeout on dead workers). Partially fixed with timeout support but `_thread_receiver`'s blocking `read()` was not addressed.
  • pytest-xdist #884 — "worksteal + high core counts leads to hangs" — similar symptom (hang after tests complete) but different root cause (queue replacement race in worksteal scheduler). Fixed in 3.2.1.
  • pytest-xdist #1071 — "concurrent remote_exec deadlock for main_thread_only execmodel" — different deadlock in execnet's execmodel, fixed in 3.6.1.
  • scikit-learn #30007 — "Upgrade free-threading CI to run with pytest-freethreaded instead of pytest-xdist" — suggests pytest-xdist is not fully compatible with free-threaded Python.

Suggested fix

The root cause is in execnet (`gateway_base.py:534`). The `_thread_receiver` loop's `read()` call blocks indefinitely when a worker pipe doesn't deliver EOF on process exit. Options:

  1. execnet fix: Use non-blocking or timeout-based reads in `_thread_receiver`, or poll the worker process liveness.
  2. xdist mitigation: In `loop_once()`, add a worker liveness check after N consecutive `Empty` timeouts — if all workers' subprocesses have exited (via `os.waitpid` or similar), force-unregister them from `_active_nodes`.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions