Replies: 1 comment 5 replies
-
Yes, I think `on_save_checkpoint would be the cleanest solution. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
A missing piece for fault-tolerant training is to properly update the progress bar when we resume the training.
In our ProgressBarBase class we track progress independently of the loop. There was an idea to replace that with progress tracking from the loops if I recall correctly. However note, if we do that then the progress bar will be locked to a particular loop structure (fitloop -> epoch loop) and their corresponding progress attributes. Is this acceptable? For a new loop structure, potentially a new progress bar callback would be needed. What are your thoughts on this?
Alternatives:
on_save_checkpoint
callback hookBeta Was this translation helpful? Give feedback.
All reactions