CPU memory leak when using PL on GCP TPUs #7702
Unanswered
Natithan
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment 6 replies
-
Hi @Natithan, thanks for posting this, will give it a look early next week. Did you give the new TPU VMs a try? And let me know if the problem still persists? |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I've been stuck for quite some time on this.
I am training a VilBERT-like model, and because each training run takes quite a long time, I am running it on google cloud TPUs in the hope of speeding it up.
It doesn't work however, as during training the CPU memory slowly increases until it's killed by the VMs OOM-process-killer.
The TPU memory is not filling up however. (See image below).
I have 64GB CPU RAM available, and am using a v3-8 TPU.
As I didn't observe this increase in CPU memory while training on the GPUs in my lab, I think it might have to do with how I am using the TPUs.
I'm using pytorch-lightning to move my code to TPUs, hence why I'm hoping someone on this forum can help me :).
I also checked the memory usage when using one TPU core (rather than 8), and there too the memory increases per step (see image).
For that run, I also tracked memory usage using mprof. (see image)
What is odd, is that this gave me a really different graph however, one in which there is no leak apparently. I feel like this should give me some hint, but I'm not sure what :).
I've also tried to use memory_profiler to see the memory-changes line-by-line.
Because the program is multiprocess however, this gives quite a bit of seemingly unrelated memory fluctuation.
I did a bit of googling around, here are some related threads that didn't solve my problem:
If anyone had a similar problem, and could point me in the right direction (or even how to best find the cause), that would be great!
Beta Was this translation helpful? Give feedback.
All reactions