-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.5.x
Description
Update:
It looks like I initialized the model in the wrong hook. issue has solved.
======
When ckpt files are particularly large (larger than 2 GB), it seems that each rank reads ckpt from the hard disk once. Therefore, under multi-GPU missions, it will be severely restricted by IO.
Why is ckpt so large? My model parameter size is less than 150MB
This is a serious problem for me. This problem caused me to run an 8 gpu task on slurm that took 20 minutes to initialize all ranks at a time. And every time NaN appears in training, I have to wait repeatedly in vain.
What version are you seeing the problem on?
v2.5
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.5.x