-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Make asyncio checkpointing work if validate/fit is called more than once #20952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
let me know if this looks ok and then I'll fix the mypy errors. |
any feedback? is it worth spending time fixing mypy / conflict or there is no interest in this (I would say the more serious issue is the linked issue which seems like it could result in checkpoints which are degraded). |
# CheckpointIO doesn't have a setup method so we have to do something like. | ||
# We can't do setup in __init__ because if train or validate is called more than once the | ||
# teardown method deletes the executor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can put this in a function's docstring?
What does this PR do?
Currently if using async checkpointing if fit or validate is called than once it will crash (because the threadpool is shutdown and never re-created).
This PR modifies the test to induce the crash and fixes it.
No.
Was this discussed/agreed via a GitHub issue? (not for typos and docs)
No, this is just a bugfix, not a behavior change. Should I create an issue?
Did you read the contributor guideline, Pull Request section?
Yes
Did you make sure your PR does only one thing, instead of bundling different changes together?
Yes
Did you make sure to update the documentation with your changes? (if necessary)
na
Did you write any new necessary tests? (not for typos and docs)
yes
Did you verify new and existing tests pass locally with your changes?
as best I could, I'm not very clear the recommended setup for testing pytorch lightning locally, I was only able to run the test I modified.
Did you list all the breaking changes introduced by this pull request?
na
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)
Yes
📚 Documentation preview 📚: https://pytorch-lightning--20952.org.readthedocs.build/en/20952/