-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
🚀 The feature, motivation and pitch
Current approach:
The current philosophy for MPI process error handling is to call MPI_Abort when we encounter an error in a worker process:
TensorRT-LLM/cpp/tensorrt_llm/runtime/utils/mpiUtils.cpp
Lines 193 to 210 in 24f9272
| for (int sig : {SIGABRT, SIGSEGV}) | |
| { | |
| __sighandler_t previousHandler = nullptr; | |
| if (forwardAbortToParent) | |
| { | |
| previousHandler = std::signal(sig, | |
| [](int signal) | |
| { | |
| #ifndef _WIN32 | |
| pid_t parentProcessId = getppid(); | |
| kill(parentProcessId, SIGKILL); | |
| #endif | |
| MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE); | |
| }); | |
| } | |
| else | |
| { | |
| previousHandler = std::signal(sig, [](int signal) { MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE); }); |
This will cause the program to terminate early and abruptly (see: https://www.open-mpi.org/doc/v4.1/man3/MPI_Abort.3.php).
Motivation:
An abrupt termination of the program is not desirable for consumers that make use of this library. It would be beneficial if these errors are propagated back to the leader process for the consumer to catch and gracefully exit execution.
Proposal:
We add a mechanism to toggle between early exiting when an error is encountered in a worker process (i.e. current behavior), and propagating the error back to the leader.
Discussion related to this: #9284 (comment)
Any comments/suggestions are welcomed, and would the maintainers be open to this kind of change?
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.