Skip to content

Reconciliate processes masks errors in DDP plugin if _sync_dir isn't initializedΒ #9263

@four4fish

Description

@four4fish

πŸ› Bug

If someone has no trainable parameters in their model, and call trainer.test(), Lightning would raise an exception because the DDP module construction would raise an exception around no parameters with requires_grad found.

this error is compounded by the process reconciliation check. when configure_ddp raises an exception, we don't call _share_information_to_prevent_deadlock in the DDP plugin

def pre_dispatch(self):
        # move the model to the correct device
        self.model_to_device()

        if self.sync_batchnorm:
            self.model = self.configure_sync_batchnorm(self.model)

       # raise exception here because on trainable params
        self.configure_ddp()

        # sync_dir initialize here, will be skipped
        self._share_information_to_prevent_deadlock()`

as a result, sync_dir is not initialized.

As part of the exception handling, we call reconciliate_processes: https://github.com/PyTorchLightning/pytorch-lightning/blob/35876bb75f27eb8f44220afd4bfa757a0432d233/pytorch_lightning/trainer/trainer.py#L519

which causes this path creation to fail:

def reconciliate_processes(self, trace: str):
        if self.world_size < 2:
            return

        sync_dir = self._sync_dir

        # The cluster may be configured to periodically purge the `/tmp`
        # directory, in which case `sync_dir` may not exist anymore at this
        # point. Idempotently create it to ensure its existence.
        Path(sync_dir).mkdir(parents=True, exist_ok=True)`

@ananthsub pointed out this in #9096

To Reproduce

Expected behavior

  1. Initialize sync_dir before configure_ddp is called
  2. Check sync_dir is not None before create path

Environment

  • PyTorch Lightning Version (e.g., 1.3.0):
  • PyTorch Version (e.g., 1.8)
  • Python version:
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions