One gpu 50% higher memory utilization than others? #12634

exnx · 2022-04-06T06:15:24Z

exnx
Apr 6, 2022

Hi, I am training ConvNext with 8 A100 gpus on GCP. When I am training with standard ConvNext-base, all 8 GPUs are utilized pretty evenly. However, I have this new layer (from the S4 paper) that I am swapping in for the group convolution in ConvNext. It runs, but then 1 gpu is at 39/40 GB, while the other 7 gpus are at 25/40 gpus, greatly limiting my training batch size :( And in turn, much slower, and costly with cloud credits.

As I was trying to investigate, here is something I found. The first gpu has 6 processes with ~2gb each memory each that were presumably supposed to be distributed to the other gpus, but for whatever reason, stuck to the first gpu.

Other than that, I have no idea even how to go about finding the source of the issue.

I was wondering if anyone had any hint on how to go about finding the problem? Any tips at all, ANY!, would be sooooo helpful. Thanks for your help.

I am using:
PTL 1.6, Python 3.8, Torch 1.11 cuda 11.3.2.

Best,

Eric

akihironitta · 2022-04-07T09:41:15Z

akihironitta
Apr 7, 2022

Duplicate of #12651. Let's track in the issue to avoid scattering information.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

One gpu 50% higher memory utilization than others? #12634

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

One gpu 50% higher memory utilization than others? #12634

Uh oh!

exnx Apr 6, 2022

Replies: 1 comment

Uh oh!

akihironitta Apr 7, 2022

exnx
Apr 6, 2022

akihironitta
Apr 7, 2022