Skip to content

multigpu ddp: Code after fit executed many times #8678

@johngull

Description

@johngull

🐛 Bug

After training model with the Trainer.fit on 4-gpu machine with the accelerator="ddp", my code which goes after that executed 3 (?) times.
I receive 2 exceptions "FileNotFoundError" and then printing of successful weights saving.

To Reproduce

....
trainer = pl.Trainer(
    gpus=-1,
    precision=16 if train_opt.get("fp16", False) else 32,
    accelerator="ddp",
    accumulate_grad_batches=train_opt.get("grad_accum", 1),
    max_epochs=train_opt.get("epochs", 20),
    default_root_dir=train_opt.get("root_dir", None),
    callbacks=callbacks,
    logger=logger,
    log_every_n_steps=1,
)
....
trainer.fit(model, dataloaders[0], dataloaders[1])
if trainer.state.status != TrainerStatus.FINISHED:
    raise InterruptedError()

path = checkpoint_callback.best_model_path

os.makedirs(os.path.dirname(target_path), exist_ok=True)
model.load_state_dict(torch.load(str(path))["state_dict"])
torch.save(model.model.state_dict(), target_path)

Expected behavior

A single execution of the code after trainer.fit

Environment

  • CUDA:
    • GPU:
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
      • Tesla V100-SXM2-16GB
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.5
    • pyTorch_debug: False
    • pyTorch_version: 1.6.0
    • pytorch-lightning: 1.4.0rc0
    • tqdm: 4.61.2
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.7.7
    • version: Proposal for help #1 SMP Tue May 11 20:50:07 UTC 2021

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 1Medium priority task

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions