TPU training memory issues #9459
Unanswered
tuner007
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am trying to train a large language model(xlm-roberta-large) on tpu 8 cores but it goes OOM on kaggle and colab.
When using with "precision=16" and "checkpoint_callback=True" training works but it goes OOM while saving the model.
I can check in the discussion here the APIs added by dlibenzi might be able to help here.
Is it already incorporated in lightning ?
A lot of people use kaggle and colab for training large language models, TPU training helps a lot particularly in the case of lightning as it hardly needs any code changes but this memory issue is the major challenge which forces me to use TF.
Thanks !!
Beta Was this translation helpful? Give feedback.
All reactions