-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked on
Milestone
Description
π Bug
If someone has no trainable parameters in their model, and call trainer.test(), Lightning would raise an exception because the DDP module construction would raise an exception around no parameters with requires_grad found.
this error is compounded by the process reconciliation check. when configure_ddp raises an exception, we don't call _share_information_to_prevent_deadlock in the DDP plugin
def pre_dispatch(self):
# move the model to the correct device
self.model_to_device()
if self.sync_batchnorm:
self.model = self.configure_sync_batchnorm(self.model)
# raise exception here because on trainable params
self.configure_ddp()
# sync_dir initialize here, will be skipped
self._share_information_to_prevent_deadlock()`
as a result, sync_dir is not initialized.
As part of the exception handling, we call reconciliate_processes: https://github.com/PyTorchLightning/pytorch-lightning/blob/35876bb75f27eb8f44220afd4bfa757a0432d233/pytorch_lightning/trainer/trainer.py#L519
which causes this path creation to fail:
def reconciliate_processes(self, trace: str):
if self.world_size < 2:
return
sync_dir = self._sync_dir
# The cluster may be configured to periodically purge the `/tmp`
# directory, in which case `sync_dir` may not exist anymore at this
# point. Idempotently create it to ensure its existence.
Path(sync_dir).mkdir(parents=True, exist_ok=True)`
@ananthsub pointed out this in #9096
To Reproduce
Expected behavior
- Initialize
sync_dirbeforeconfigure_ddpis called - Check
sync_diris notNonebefore create path
Environment
- PyTorch Lightning Version (e.g., 1.3.0):
- PyTorch Version (e.g., 1.8)
- Python version:
- OS (e.g., Linux):
- CUDA/cuDNN version:
- GPU models and configuration:
- How you installed PyTorch (
conda,pip, source): - If compiling from source, the output of
torch.__config__.show(): - Any other relevant information:
Additional context
ananthsub
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked on