Skip to content

Errors are masked with DDP when processes are created external to lightning #8653

@ananthsub

Description

@ananthsub

🐛 Bug

Process reconciliation added recently during the error handling path has a bug:
https://github.com/PyTorchLightning/pytorch-lightning/blob/16392a7de787a1c0c0163f9[…]95f69178d5aef3a9/pytorch_lightning/plugins/training_type/ddp.py

_sync_dir is initialized only during call_children_scripts , which is not called if the processes are externally created. hence this will mask errors when training with Slurm or torch distributed elastic or other custom cluster environments

Users will see this issue instead:

torch.save(True, os.path.join(sync_dir, f"{self.global_rank}.pl")) 
TypeError: expected str, bytes or os.PathLike object, not NoneType

which is an exception that occurs during the handling of the root cause exception. this makes it trickier to debug

Expected behavior

There should be no exceptions during the exception handling path so the root error is clear to users

Environment

This is present on Lightning 1.4

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions