Skip to content

Hard to debug subprocess crashes when using concurrent.futures.ProcessPoolExecutorย #139462

@jberg5

Description

@jberg5

Feature or enhancement

Proposal:

Background

When using a concurrent.futures.ProcessPoolExecutor, if a subprocess crashes/terminates, the library gives you an exception like:

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

This doesn't tell you two very important bits of information:

  • Which subprocess terminated
  • What exit code it returned

When running a nontrivial application, odds are that you'll run into a BrokenProcessPool, whether because a subprocess segfaulted, OOMed, or something else, and having some basic information about what exactly died and why would be incredibly helpful for correlating crashes with logs/telemetry/etc and aiding further investigation.

As a possible but suboptimal workaround, you can inspect the private _processes list upon failure to try and figure out what went wrong, e.g.

from concurrent.futures import as_completed, ProcessPoolExecutor
from concurrent.futures.process import BrokenProcessPool
import ctypes
import os
import traceback

def worker_ok() -> str:
    print(f"worker_ok {os.getpid()=}", flush=True)
    return "ok"

def worker_segfault():
    print(f"worker_segfault {os.getpid()=}", flush=True)
    ctypes.string_at(0)

if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=2) as executor:
        futures = [
            executor.submit(worker_ok),
            executor.submit(worker_segfault),
        ]
        for future in as_completed(futures):
            try:
                future.result()
            except BrokenProcessPool as e:
                # Need to look at private _processes to figure out what died and why
                for pid, p in executor._processes.items():
                    if p.exitcode:
                        print(f"Process {pid} failed with exitcode {p.exitcode}")
                traceback.print_exception(e)
worker_ok os.getpid()=37334
worker_segfault os.getpid()=37335
Process 37334 failed with exitcode -15
Process 37335 failed with exitcode -11
Traceback (most recent call last):
  File "/users/me/projects/repo/example.py", line 26, in <module>
    future.result()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Accessing _processes works, but it means reading a nonpublic part of the api, and writing slightly nontrivial exception handling code. It would be much better if the ProcessPoolExecutor automatically reported this kind of crash info when raising a BrokenProcessPool exception.

Proposal

There is already a concept in the codebase of a bpe.__cause__, which contains the cause of the BrokenProcessPool if one is known. When _ExecutorManagerThread._terminate_broken is called, it should loop through the processes and report any nonzero exit codes.

I've implemented the above approach, and running the example script above with my change, I now see:

worker_ok os.getpid()=37202
worker_segfault os.getpid()=37202
Lib.concurrent.futures.process._RemoteTraceback: Process 37202 terminated abruptly with exit code -11

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jberg/Projects/cpython/example.py", line 26, in <module>
    future.result()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
Lib.concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Note the Lib.concurrent.futures.process._RemoteTraceback above - imo this is a much more informative traceback.

If this looks good to everyone, I'll raise a PR shortly.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

Labels

stdlibStandard Library Python modules in the Lib/ directorytopic-multiprocessingtype-featureA feature request or enhancement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions