Skip to content

Spend a lot of time to load large ckpt #21017

@5o1

Description

@5o1

Update:

It looks like I initialized the model in the wrong hook. issue has solved.

======

When ckpt files are particularly large (larger than 2 GB), it seems that each rank reads ckpt from the hard disk once. Therefore, under multi-GPU missions, it will be severely restricted by IO.

Why is ckpt so large? My model parameter size is less than 150MB

This is a serious problem for me. This problem caused me to run an 8 gpu task on slurm that took 20 minutes to initialize all ranks at a time. And every time NaN appears in training, I have to wait repeatedly in vain.

What version are you seeing the problem on?

v2.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions