Skip to content

Conversation

@Borda
Copy link
Contributor

@Borda Borda commented Apr 4, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

This pull request introduces an asynchronous model manager that processes upload and remove tasks in the background. By offloading these actions to a worker thread, it avoids blocking the training loop and achieves better resource utilization, ensuring that model performance or training speed is not impacted by file operations.

Additionally, the new approach uses a single queue but maintains separate counters for uploads and removals, making it simple to track task progress in each category. This design keeps the code streamlined and offers a clear way to monitor both types of background tasks while training.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@Borda Borda requested review from Copilot and tchaton April 4, 2025 13:24
@Borda Borda added the enhancement New feature or request label Apr 4, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/litmodels/integrations/checkpoints.py:207

  • Duplicate definition of _remove_checkpoint detected; this may lead to maintenance challenges or ambiguous behavior. Please remove the redundant implementation.
def _remove_checkpoint(self, trainer: "pl.Trainer", filepath: str) -> None:

self.task_queue = queue.Queue()
self.upload_count = 0
self.remove_count = 0
self._worker = threading.Thread(target=self._worker_loop, daemon=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it behaves with keyboard interruption ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will likely continue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. But worth to see how it behaves in general. I think W&B still allows you to stop if you do multiple keyboard interrupt

Copy link
Member

@ethanwharris ethanwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@Borda Borda merged commit 0c3709e into main Apr 4, 2025
42 checks passed
@Borda Borda deleted the upload/subprocess branch April 4, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants