Is there any pytorch_lightning version that ddp is reliable? #8177
Unanswered
yllgl
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment 1 reply
-
Hi There was nothing in the log files? DDP is quite hard to get right. Lightning is continuously fixing and improving the DDP support as it is a very good and popular choice for multi-GPU training. We are adding new tests with every bugfix to avoid regressions. Generally I can recommend latest PL version paired with the latest pytorch version. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been using ddp since it's officially recommended. I'm running the program on a cluster with 2 nodes and 8 GPUs, but after running it for a day it gives me an error that the program ends automatically.
slurmstepd: error: STEP 3311924.0 ON gn30 CANCELLED AT 2021-06-27T02: 22: 26 DUE TO TIME LIMIT
. I asked the administrator of the cluster and he said that there is no time limit and this error should not occur. I found there are a lot of ddp bugs in the issues , so I would like to ask if there is a more stable ddp version of pytorch_lightning?enviroment
- GPU:
- Tesla K80
- Tesla K80
- Tesla K80
- Tesla K80
- available: True
- version: 10.2
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.8.0
- pytorch-lightning: 1.2.10
- tqdm: 4.47.0
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.3
- version: SMP Mon Apr 23 15:52:50 CST 2018
Beta Was this translation helpful? Give feedback.
All reactions