-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority taskwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update
Milestone
Description
🐛 Bug
Hello again!
Some days ago I opened this issue:
Code freezes before validation sanity check when using DDP
Basically, DDP wasn't working, and this was related to Jupyter Notebook unable to use ddp as accelerator.
So, some days later, I tried to re-run my script in PyCharm first, then in the terminal (I just did some changes, like using MADGRAD as optimizer, nothing more).
Even there I can't use DDP. I tried both using gpus=2 and gpus=-1. This time the code freezes here:
COMET INFO: Experiment is live on comet.ml link
CometLogger will be initialized in offline mode
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
COMET INFO: Experiment is live on comet.ml link
CometLogger will be initialized in offline mode
Using native 16bit precision.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: url : link
COMET INFO: Uploads:
COMET INFO: environment details : 1
COMET INFO: filename : 1
COMET INFO: installed packages : 1
COMET INFO: os packages : 1
COMET INFO: source_code : 1
COMET INFO: ---------------------------
COMET WARNING: Empty mapping given to log_params({}); ignoring
| Name | Type | Params
---------------------------------------
0 | encoder | Sequential | 1.3 M
1 | decoder | Sequential | 1.3 M
---------------------------------------
2.6 M Trainable params
0 Non-trainable params
2.6 M Total params
10.524 Total estimated model params size (MB)
Even if it says initializing ddp, only the first GPU is ON, the other are OFF:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:05:00.0 On | N/A |
| 31% 50C P2 83W / 250W | 1114MiB / 12194MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:06:00.0 Off | N/A |
| 24% 47C P2 60W / 250W | 37MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:09:00.0 Off | N/A |
| 24% 46C P2 64W / 250W | 13MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:0A:00.0 Off | N/A |
| 23% 40C P2 63W / 250W | 13MiB / 12196MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Do you know what the problem could be?
The network isn't that big, so I could use only 1 GPU or I could use dp or ddp_spawn, but they are not recomended
- PyTorch Version: 1.8.1
- OS: Ubuntu 18.04
- How you installed PyTorch (
conda,pip, source): 'conda' - Python version: 3.8
- CUDA/cuDNN version: 11.2
- GPU models and configuration: 4 x TITAN Xp 12GB
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topichelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority taskwaiting on authorWaiting on user action, correction, or updateWaiting on user action, correction, or update