-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingenvironment: slurmhelp wantedOpen to be worked onOpen to be worked onver: 2.0.xver: 2.1.xver: 2.2.xver: 2.5.x
Description
Bug description
During training whenever there is a keyboard interrupt the fit loop raises a SIGTERMException
pytorch-lightning/src/lightning/pytorch/loops/fit_loop.py
Lines 397 to 398 in 98005bb
if trainer.received_sigterm: | |
raise SIGTERMException |
which results in a 0 exit code. Other scripts relying on the exit code of the training script pick this up as if the training script has exited normally.
The issue comes from here:
pytorch-lightning/src/lightning/pytorch/utilities/exceptions.py
Lines 19 to 28 in 98005bb
class SIGTERMException(SystemExit): | |
"""Exception used when a :class:`signal.SIGTERM` is sent to a process. | |
This exception is raised by the loops at specific points. It can be used to write custom logic in the | |
:meth:`lightning.pytorch.callbacks.callback.Callback.on_exception` method. | |
For example, you could use the :class:`lightning.pytorch.callbacks.fault_tolerance.OnExceptionCheckpoint` callback | |
that saves a checkpoint for you when this exception is raised. | |
""" |
raising a SystemExit
in python without specifying the exit code, has the code set to None
which gets converted to 0
. The fix would be to have:
class SIGTERMException(SystemExit):
"""Exception used when a :class:`signal.SIGTERM` is sent to a process.
This exception is raised by the loops at specific points. It can be used to write custom logic in the
:meth:`lightning.pytorch.callbacks.callback.Callback.on_exception` method.
For example, you could use the :class:`lightning.pytorch.callbacks.fault_tolerance.OnExceptionCheckpoint` callback
that saves a checkpoint for you when this exception is raised.
"""
code = 128 + 15 # see https://tldp.org/LDP/abs/html/exitcodes.html
What version are you seeing the problem on?
v2.0, v2.1, v2.2, master
How to reproduce the bug
In a python console run
import pytorch_lightning as pl
raise pl.utilities.exceptions.SIGTERMException
then do
echo $?
or
Start a training and then send a keyboard interrupt signal to it, and run echo $?
to see the exit code.
cc @awaelchli
awaelchli and mamo3gr
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingenvironment: slurmhelp wantedOpen to be worked onOpen to be worked onver: 2.0.xver: 2.1.xver: 2.2.xver: 2.5.x