[help / suggestion needed!] Multi-GPU training freeze randomly after validation #7516

ConvMech · 2021-05-13T00:46:54Z

ConvMech
May 13, 2021

Problem:
using DDP for multi-gpu training. The training process may randomly stop without any error message after validation. when I check using nvidia-smi, normally it should have 1 master process and 3 other processes, but when the problem happens it will only have 2 other processes

Device:
training on a google cloud platform with 4*v100 GPU.

Log:
MultiGPU Freezes
as shown below, One GPU kept being 0%, and usually there should be 4 Processes, now there's only 3

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 41C P0 69W / 300W | 15412MiB / 16160MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 36C P0 33W / 300W | 11MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 40C P0 66W / 300W | 15926MiB / 16160MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 38C P0 57W / 300W | 15926MiB / 16160MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4628 C python 15407MiB |
| 2 N/A N/A 4720 C ...mmy/miniconda3/bin/python 15921MiB |
| 3 N/A N/A 4754 C ...mmy/miniconda3/bin/python 15921MiB |
+-----------------------------------------------------------------------------+

I am validating every 25% of one epoch, basically 4 times per epoch. This behavior has consistently happened after a validation phase finishes

Epoch 1: 25%|██████████▊ | 3997/15986 [2:09:02<6:27:05, 1.94s/it, loss=97.8, v_num=4xuc]

Thoughts
Is there anything that could stop a multi-gpu process from starting? like exceeding gpu memory? If I want to further debug it, what should I do to collect more information? any help would be greatly appreciated!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[help / suggestion needed!] Multi-GPU training freeze randomly after validation #7516

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[help / suggestion needed!] Multi-GPU training freeze randomly after validation #7516

Uh oh!

ConvMech May 13, 2021

Replies: 0 comments

ConvMech
May 13, 2021