Spend a lot of time to load large ckpt

Update：

It looks like I initialized the model in the wrong hook. issue has solved.

======

When ckpt files are particularly large (larger than 2 GB), it seems that each rank reads ckpt from the hard disk once. Therefore, under multi-GPU missions, it will be severely restricted by IO.

Why is ckpt so large? My model parameter size is less than 150MB

This is a serious problem for me. This problem caused me to run an 8 gpu task on slurm that took 20 minutes to initialize all ranks at a time. And every time NaN appears in training, I have to wait repeatedly in vain.

### What version are you seeing the problem on?

v2.5


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spend a lot of time to load large ckpt #21017

What version are you seeing the problem on?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Spend a lot of time to load large ckpt #21017

Description

What version are you seeing the problem on?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions