[Feature]: Allow errors in MPI worker process to propagate back to the leader

### 🚀 The feature, motivation and pitch

**Current approach**:
The current philosophy for `MPI` process error handling is to call `MPI_Abort` when we encounter an error in a worker process: https://github.com/NVIDIA/TensorRT-LLM/blob/24f92721f25a19c77d9128a34c3a72f3a10533e9/cpp/tensorrt_llm/runtime/utils/mpiUtils.cpp#L193-L210
This will cause the program to terminate early and abruptly (see: https://www.open-mpi.org/doc/v4.1/man3/MPI_Abort.3.php).

**Motivation**:
An abrupt termination of the program is not desirable for consumers that make use of this library. It would be beneficial if these errors are propagated back to the leader process for the consumer to catch and gracefully exit execution.

**Proposal**:
We add a mechanism to toggle between early exiting when an error is encountered in a worker process (i.e. current behavior), and propagating the error back to the leader.

-----------------------------------------------------------------------------------------------------------------------------
Discussion related to this: https://github.com/NVIDIA/TensorRT-LLM/discussions/9284#discussion-9155029

Any comments/suggestions are welcomed, and would the maintainers be open to this kind of change?

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

	for (int sig : {SIGABRT, SIGSEGV})
	{
	__sighandler_t previousHandler = nullptr;
	if (forwardAbortToParent)
	{
	previousHandler = std::signal(sig,
	[](int signal)
	{
	#ifndef _WIN32
	pid_t parentProcessId = getppid();
	kill(parentProcessId, SIGKILL);
	#endif
	MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
	});
	}
	else
	{
	previousHandler = std::signal(sig, [](int signal) { MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE); });

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Allow errors in MPI worker process to propagate back to the leader #9921

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Allow errors in MPI worker process to propagate back to the leader #9921

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions