"resume from checkpoint" lead to CUDA out of memory #11563

Defiler24 · 2022-01-21T04:26:35Z

Defiler24
Jan 21, 2022

When I use “resume from checkpoint”,
there is a “CUDA out of memory” problem,
when using torch.load(), set "map location" to "cpu" can solve this problem,
in "resume from checkpoint" scenario, what should I do?

Answered by Defiler24

Jan 24, 2022

I solved the problem after setting the strategy to 'ddp'.

View full answer

four4fish · 2022-01-21T18:09:18Z

four4fish
Jan 21, 2022

@Defiler24 Could you share which training strategy(plugin) are you using? Or could you share your code here

1 reply

Defiler24 Jan 22, 2022
Author

I did not set a specific strategy, the default should be DDP spawn.

data  = DWheel(**vars(args))
model = MWheel(**vars(args))

trainer = Trainer(
                max_epochs=480, 
                gpus=8, 
                sync_batchnorm=True,
                resume_from_checkpoint="../version0/checkpoint/epoch=150.ckpt")
trainer.fit(model, data)

Defiler24 · 2022-01-24T08:04:32Z

Defiler24
Jan 24, 2022
Author

I solved the problem after setting the strategy to 'ddp'.

0 replies

YueyangLiulyy · 2022-08-10T06:06:15Z

YueyangLiulyy
Aug 10, 2022

I have exactly the same issue, the only difference is that my model is trained on single GPU. I did not specify any stratgey as well, is there any solution to solve it? Thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"resume from checkpoint" lead to CUDA out of memory #11563

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

"resume from checkpoint" lead to CUDA out of memory #11563

Uh oh!

Defiler24 Jan 21, 2022

Replies: 3 comments · 1 reply

Uh oh!

four4fish Jan 21, 2022

Uh oh!

Defiler24 Jan 22, 2022 Author

Uh oh!

Defiler24 Jan 24, 2022 Author

Uh oh!

YueyangLiulyy Aug 10, 2022

Defiler24
Jan 21, 2022

Replies: 3 comments 1 reply

four4fish
Jan 21, 2022

Defiler24 Jan 22, 2022
Author

Defiler24
Jan 24, 2022
Author

YueyangLiulyy
Aug 10, 2022