Unable to load deepspeed checkpoint #12246
Replies: 2 comments 5 replies
-
hey @lanx7 ! is it possible to share a reproducible issue using https://colab.research.google.com/github/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report/bug_report_model.ipynb also, can you try with master too? this might have been fixed recently and next release is soon |
Beta Was this translation helpful? Give feedback.
-
Hi @rohitgr7 and other colloborator. After some observation, I found out that the checkpoint file produced by 'zero_to_fp32.py' or convert_zero_checkpoint_to_fp32_state_dict() is a pytorch standard checkpoint file rather than the one of pytorch-lightning model. I have succeeded in loading the model using the standardized pytorch API.
But, I still have two issues.
The outputs of model are similar to ones when I did not use Deepspeed, but it is still suspicious due to some issues. Can you confirm if this method is correct? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi.
I found an error while testing Deep Speed Stage 3. Ask for help.
I've saved the checkpoint file and collated it into a single file according to the guide.
When I loaded the file (final.pt) with load_from_checkpoint() function, the following error occurred.
Please refer to my code below.
Library Version:
pytorch-ignite==0.4.8
pytorch-lightning==1.5.10
torch==1.10.2+cu113
torchaudio==0.10.2+cu113
torchmetrics==0.7.2
torchtext==0.11.2
torchvision==0.11.3+cu113
Beta Was this translation helpful? Give feedback.
All reactions