SIGTERMException is not raised consistently across all ranks in DDP

### Bug description

SIGTERMException is not raised consistently across all ranks in DDP training because PyTorch Lightning doesn't handle SIGTERM-s well for distributed jobs. As a result checkpointing on SIGTERM can not be implemented reliably for DDP without workarounds in client code.

## Issue

The `SIGTERMException` is raised in `on_advance_end`. When certain ranks proceed beyond this point and begin the next training step, they become deadlocked while waiting for ranks that raised the exception. The complete SIGTERM handling logic is detailed in the section below. Steps #6 - #8 are not executed consistently.

This can lead to the following deadlock condition:

- All ranks complete gradient sharing and optimization at step N-1.
- Rank 0 receives SIGTERM, enters the handler, and forwards the SIGTERM to other ranks.
- Meanwhile, other ranks finish step N-1 and begin step N. They wait for rank 0 to join.
- Rank 0 completes step N-1 and raises `SIGTERMException` in on_advance_end.
    - Rank 0 never joins step N, and other ranks never reach `on_advance_end` on step N, preventing them from raising `SIGTERMException`.

Schematically:

![Image](https://github.com/user-attachments/assets/d8801431-373f-49a7-b069-76b9847d02a8)

## SIGTERM handling logic in PyTorch Lightning

1. Kubernetes API Server receives a request to abort a job.
2. Kubernetes API Server sends an abort request to kubelets on every node.
3. Kubelet sends a SIGTERM signal to the main process of the pytorch container.
    1. It waits for grace period and then
    2. It sends a KILL signal.
4. PL `_SignalConnector` receives the SIGTERM on the local rank 0 (main process) ([[github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/trainer/connectors/signal_connector.py#L105-L113)](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/trainer/connectors/signal_connector.py#L105-L113))
    1. It prints `[rank: 0] Received SIGTERM: ...`
    2. It calls `strategy.auncher.kill`
5. The [[DDPStrategy](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/strategies/ddp.py)](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/strategies/ddp.py) uses [[_MultiProcessingLauncher](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/strategies/launchers/multiprocessing.py)](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/strategies/launchers/multiprocessing.py). The launcher passes the SIGTERM to ranks 1 - N-1 ([[github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/strategies/launchers/multiprocessing.py#L260-L266C39)](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/strategies/launchers/multiprocessing.py#L260-L266C39))
    1. It prints `Process <parent> is terminating <child> with 15.`
6. All ranks set `self.received_sigterm = True` ([[github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0/src/lightning/pytorch/trainer/connectors/signal_connector.py#L113)](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0/src/lightning/pytorch/trainer/connectors/signal_connector.py#L113))
    1. It prints `[rank: N] Received SIGTERM: ...`
7. PL `_*TrainingEpochLoop.on_*advance_end` raises `SIGTERMException` when the batch processing completes ([[github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/loops/training_epoch_loop.py#L385-L386)](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/loops/training_epoch_loop.py#L385-L386))
8. The exception is passed to `on_exception` hook ([[github](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/trainer/call.py#L76)](https://github.com/Lightning-AI/pytorch-lightning/blob/2.5.0.post0/src/lightning/pytorch/trainer/call.py#L76))

### What version are you seeing the problem on?

v2.5.0.post0

### How to reproduce the bug

This issue can be consistently reproduced by introducing a 10-second sleep in the `on_train_batch_end` hook on rank 0. It will ensure that we certainly hit the deadlock condition described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SIGTERMException is not raised consistently across all ranks in DDP #20806

Bug description

Issue

SIGTERM handling logic in PyTorch Lightning

What version are you seeing the problem on?

How to reproduce the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SIGTERMException is not raised consistently across all ranks in DDP #20806

Description

Bug description

Issue

SIGTERM handling logic in PyTorch Lightning

What version are you seeing the problem on?

How to reproduce the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions