-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task
Milestone
Description
🐛 Bug
After training model with the Trainer.fit on 4-gpu machine with the accelerator="ddp", my code which goes after that executed 3 (?) times.
I receive 2 exceptions "FileNotFoundError" and then printing of successful weights saving.
To Reproduce
....
trainer = pl.Trainer(
gpus=-1,
precision=16 if train_opt.get("fp16", False) else 32,
accelerator="ddp",
accumulate_grad_batches=train_opt.get("grad_accum", 1),
max_epochs=train_opt.get("epochs", 20),
default_root_dir=train_opt.get("root_dir", None),
callbacks=callbacks,
logger=logger,
log_every_n_steps=1,
)
....
trainer.fit(model, dataloaders[0], dataloaders[1])
if trainer.state.status != TrainerStatus.FINISHED:
raise InterruptedError()
path = checkpoint_callback.best_model_path
os.makedirs(os.path.dirname(target_path), exist_ok=True)
model.load_state_dict(torch.load(str(path))["state_dict"])
torch.save(model.model.state_dict(), target_path)Expected behavior
A single execution of the code after trainer.fit
Environment
- CUDA:
- GPU:
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 1.4.0rc0
- tqdm: 4.61.2
- System:
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.7.7
- version: Proposal for help #1 SMP Tue May 11 20:50:07 UTC 2021
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task