-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Closed
Copy link
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task
Milestone
Description
🐛 Bug
Process reconciliation added recently during the error handling path has a bug:
https://github.com/PyTorchLightning/pytorch-lightning/blob/16392a7de787a1c0c0163f9[…]95f69178d5aef3a9/pytorch_lightning/plugins/training_type/ddp.py
_sync_dir is initialized only during call_children_scripts , which is not called if the processes are externally created. hence this will mask errors when training with Slurm or torch distributed elastic or other custom cluster environments
Users will see this issue instead:
torch.save(True, os.path.join(sync_dir, f"{self.global_rank}.pl"))
TypeError: expected str, bytes or os.PathLike object, not NoneType
which is an exception that occurs during the handling of the root cause exception. this makes it trickier to debug
Expected behavior
There should be no exceptions during the exception handling path so the root error is clear to users
Environment
This is present on Lightning 1.4
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 0High priority taskHigh priority task