Skip to content

Code freezes when using DDP in terminal or PyCharm #7454

@notprime

Description

@notprime

🐛 Bug

Hello again!
Some days ago I opened this issue:

Code freezes before validation sanity check when using DDP

Basically, DDP wasn't working, and this was related to Jupyter Notebook unable to use ddp as accelerator.
So, some days later, I tried to re-run my script in PyCharm first, then in the terminal (I just did some changes, like using MADGRAD as optimizer, nothing more).
Even there I can't use DDP. I tried both using gpus=2 and gpus=-1. This time the code freezes here:

COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
COMET INFO: Experiment is live on comet.ml link

CometLogger will be initialized in offline mode
Using native 16bit precision.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : link
COMET INFO:   Uploads:
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     installed packages  : 1
COMET INFO:     os packages         : 1
COMET INFO:     source_code         : 1
COMET INFO: ---------------------------
COMET WARNING: Empty mapping given to log_params({}); ignoring

  | Name    | Type       | Params
---------------------------------------
0 | encoder | Sequential | 1.3 M 
1 | decoder | Sequential | 1.3 M 
---------------------------------------
2.6 M     Trainable params
0         Non-trainable params
2.6 M     Total params
10.524    Total estimated model params size (MB)

Even if it says initializing ddp, only the first GPU is ON, the other are OFF:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:05:00.0  On |                  N/A |
| 31%   50C    P2    83W / 250W |   1114MiB / 12194MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 24%   47C    P2    60W / 250W |     37MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   46C    P2    64W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:0A:00.0 Off |                  N/A |
| 23%   40C    P2    63W / 250W |     13MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Do you know what the problem could be?
The network isn't that big, so I could use only 1 GPU or I could use dp or ddp_spawn, but they are not recomended

  • PyTorch Version: 1.8.1
  • OS: Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): 'conda'
  • Python version: 3.8
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: 4 x TITAN Xp 12GB

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdistributedGeneric distributed-related topichelp wantedOpen to be worked onpriority: 1Medium priority taskwaiting on authorWaiting on user action, correction, or update

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions