Skip to content

[Feature]: Allow errors in MPI worker process to propagate back to the leader #9921

@matiaslin

Description

@matiaslin

🚀 The feature, motivation and pitch

Current approach:
The current philosophy for MPI process error handling is to call MPI_Abort when we encounter an error in a worker process:

for (int sig : {SIGABRT, SIGSEGV})
{
__sighandler_t previousHandler = nullptr;
if (forwardAbortToParent)
{
previousHandler = std::signal(sig,
[](int signal)
{
#ifndef _WIN32
pid_t parentProcessId = getppid();
kill(parentProcessId, SIGKILL);
#endif
MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
});
}
else
{
previousHandler = std::signal(sig, [](int signal) { MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE); });

This will cause the program to terminate early and abruptly (see: https://www.open-mpi.org/doc/v4.1/man3/MPI_Abort.3.php).

Motivation:
An abrupt termination of the program is not desirable for consumers that make use of this library. It would be beneficial if these errors are propagated back to the leader process for the consumer to catch and gracefully exit execution.

Proposal:
We add a mechanism to toggle between early exiting when an error is encountered in a worker process (i.e. current behavior), and propagating the error back to the leader.


Discussion related to this: #9284 (comment)

Any comments/suggestions are welcomed, and would the maintainers be open to this kind of change?

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions