I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 #1809
Replies: 3 comments
-
Restarting is supported by default. |
Beta Was this translation helpful? Give feedback.
-
I also have the same question. It will restart from the beginning of the last iteration of the record.dpgen by default, which is not what I want.
But it did not provide any example. I am very confused with this statement "such as removing the last iterations and recovering from one checkpoint". I have no idea how to do. Could you please give me any help? Thanks a lot! |
Beta Was this translation helpful? Give feedback.
-
Question of @marcog2020460 : "How can I restart the training from the last step (1,000,000) to complete the remaining 2,000,000 steps?"
Question of @chenggoj : "You may also change it manually for your purpose, such as removing the last iterations and recovering from one checkpoint."
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps (stop_batch": 3000000), but the hpc cluster queues have a time limit of two days only for every run; how can I restart the training from the last step 1,000,000 in order to finish the remaining 2,000,000 steps.
I do not want that my training starts from zero again.
-------------------------iter.000000 task 03--------------------------
: -------------------------iter.000000 task 04--
Please help me, I look for answers on the internet, before submitting this request .
How can I modify the param.json file.
DP-GEN Version
v0.12.0
Platform, Python Version, etc
slurm hpc cluster
Details
"training": {
"_set_prefix": "set",
"stop_batch": 3000000,
"_batch_size": "auto",
"disp_file": "lcurve.out",
"disp_freq": 1000,
"numb_test": "5%",
"save_freq": 1000,
"save_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json",
"_comment": "that's all"
}
Beta Was this translation helpful? Give feedback.
All reactions