TPU training memory issues #9459

tuner007 · 2021-09-11T22:18:32Z

tuner007
Sep 11, 2021

Hi,

I am trying to train a large language model(xlm-roberta-large) on tpu 8 cores but it goes OOM on kaggle and colab.
When using with "precision=16" and "checkpoint_callback=True" training works but it goes OOM while saving the model.

I can check in the discussion here the APIs added by dlibenzi might be able to help here.

Is it already incorporated in lightning ?

A lot of people use kaggle and colab for training large language models, TPU training helps a lot particularly in the case of lightning as it hardly needs any code changes but this memory issue is the major challenge which forces me to use TF.

Thanks !!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TPU training memory issues #9459

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

TPU training memory issues #9459

Uh oh!

Uh oh!

tuner007 Sep 11, 2021

Replies: 0 comments

tuner007
Sep 11, 2021