GPU memory usage increased with the use of pl.metrics #6612

skyshine102 · 2021-03-21T05:25:10Z

skyshine102
Mar 21, 2021

I was training a simple ResNet18 model with pytorch lightening and the pl.metrics class. I used wandb to track the gpu memory usage and found sth. weird. See the figure below:

The above figure has two curves:

yellow: model without using pl.metrics
seafoam: model with pl.metrics

We can see (1) gpu memory usage increased after the first epoch in both curves. (2) With pl.metrics (I used MetricCollection in my implementation), the gpu memory usage increases with more training steps.

I did realized that after v1.2, the metric objects will not clear global state between epochs. So I called self.metric.reset() to prevent state accumulation, but still see this memory usage increase phenomenon. The code is attached: https://gist.github.com/skyshine102/83643e5499b780433cb0cdd617c4857d

Pytorch-lightening version: 1.2.1

Does anyone know how can (1),(2) happen?

rljahn · 2021-03-22T13:35:45Z

rljahn
Mar 22, 2021

ok, I made a stupid mistake. My "training_epoch_end" hook is wrong, so problem (2) has been fixed. But (1) remains. I still don't know why.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU memory usage increased with the use of pl.metrics #6612

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GPU memory usage increased with the use of pl.metrics #6612

Uh oh!

Uh oh!

skyshine102 Mar 21, 2021

Replies: 1 comment

Uh oh!

rljahn Mar 22, 2021

skyshine102
Mar 21, 2021

rljahn
Mar 22, 2021