Is there any pytorch_lightning version that ddp is reliable? #8177

yllgl · 2021-06-28T16:29:44Z

yllgl
Jun 28, 2021

I've been using ddp since it's officially recommended. I'm running the program on a cluster with 2 nodes and 8 GPUs, but after running it for a day it gives me an error that the program ends automatically.slurmstepd: error: STEP 3311924.0 ON gn30 CANCELLED AT 2021-06-27T02: 22: 26 DUE TO TIME LIMIT. I asked the administrator of the cluster and he said that there is no time limit and this error should not occur. I found there are a lot of ddp bugs in the issues , so I would like to ask if there is a more stable ddp version of pytorch_lightning?

enviroment

CUDA:
- GPU:
- Tesla K80
- Tesla K80
- Tesla K80
- Tesla K80
- available: True
- version: 10.2
Packages:
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.8.0
- pytorch-lightning: 1.2.10
- tqdm: 4.47.0
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.3
- version: SMP Mon Apr 23 15:52:50 CST 2018

awaelchli · 2021-07-05T16:17:29Z

awaelchli
Jul 5, 2021

Hi

There was nothing in the log files?
The error looks like it comes directly from slurm, so either admin is lying (just kidding) or you are not on the slurm cluster you think you are?

DDP is quite hard to get right. Lightning is continuously fixing and improving the DDP support as it is a very good and popular choice for multi-GPU training. We are adding new tests with every bugfix to avoid regressions.

Generally I can recommend latest PL version paired with the latest pytorch version.

1 reply

awaelchli Jul 5, 2021

Any info from your side would be greatly appreciated (logs, repro scripts, etc.) as it would increase the chances of finding what's wrong within Lightning or within your setup.

Disclaimer: I am not familiar with the slurm cluster. Can offer just general advice here :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there any pytorch_lightning version that ddp is reliable? #8177

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there any pytorch_lightning version that ddp is reliable? #8177

Uh oh!

Uh oh!

yllgl Jun 28, 2021

enviroment

Replies: 1 comment · 1 reply

Uh oh!

awaelchli Jul 5, 2021

Uh oh!

awaelchli Jul 5, 2021

yllgl
Jun 28, 2021

Replies: 1 comment 1 reply

awaelchli
Jul 5, 2021