[help / suggestion needed!] Multi-GPU training freeze randomly after validation #7516
Unanswered
ConvMech
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Problem:
using DDP for multi-gpu training. The training process may randomly stop without any error message after validation. when I check using
nvidia-smi
, normally it should have 1 master process and 3 other processes, but when the problem happens it will only have 2 other processesDevice:
training on a google cloud platform with 4*v100 GPU.
Log:
MultiGPU Freezes
as shown below, One GPU kept being 0%, and usually there should be 4 Processes, now there's only 3
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 41C P0 69W / 300W | 15412MiB / 16160MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 36C P0 33W / 300W | 11MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 40C P0 66W / 300W | 15926MiB / 16160MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 38C P0 57W / 300W | 15926MiB / 16160MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4628 C python 15407MiB |
| 2 N/A N/A 4720 C ...mmy/miniconda3/bin/python 15921MiB |
| 3 N/A N/A 4754 C ...mmy/miniconda3/bin/python 15921MiB |
+-----------------------------------------------------------------------------+
I am validating every 25% of one epoch, basically 4 times per epoch. This behavior has consistently happened after a validation phase finishes
Epoch 1: 25%|██████████▊ | 3997/15986 [2:09:02<6:27:05, 1.94s/it, loss=97.8, v_num=4xuc]
Thoughts
Is there anything that could stop a multi-gpu process from starting? like exceeding gpu memory? If I want to further debug it, what should I do to collect more information? any help would be greatly appreciated!!
Beta Was this translation helpful? Give feedback.
All reactions