Remove .lock download skipping, skip locks on force download#519
Remove .lock download skipping, skip locks on force download#519tchaton merged 8 commits intoLightning-AI:mainfrom
.lock download skipping, skip locks on force download#519Conversation
|
It looks like Also, Would checking the trace logs provide more insights for failing behind |
|
Hey @JackUrb. Any updates on this PR ? The retry should be taken care of. Something is off with s5cmd then. Do you see any errors ? What if we just remove this block ? |
|
Hi folks, we're still testing this out. It has made most jobs fully stable, but some not. Will update once I've gotten to the bottom of things |
|
Hey @JackUrb Feel free to revert the changes to src/litdata/streaming/downloader.py. |
|
Hey @JackUrb. Any luck ? |
|
Sorry, am on PTO. The previous round of changes still hasn't gotten us to 100% stability though, some runs are still ending up with file not found errors. |
Let s revert my PR then. |
|
Alright it seems we're stable again with the PR reverted and a |
.lock download skipping..lock download skipping, skip locks on force download
|
Testing that on my next run, didn't start on a clean cache this time |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #519 +/- ##
===================================
- Coverage 79% 79% -0%
===================================
Files 39 39
Lines 5888 5892 +4
===================================
+ Hits 4642 4644 +2
- Misses 1246 1248 +2 🚀 New features to boost your workflow:
|
|
Hey @JackUrb. Is it ready for review and being merged ? |
|
Hey @JackUrb. Any updates ? Should we land this PR ? |
|
Feel free to commandeer/edit and merge, I'm using this version with tombstoning for debugging |
|
Hey @JackUrb for the contribution. I made a new release with your fix. |
Before submitting
What does this PR do?
Part of #512, but maybe not the full story?
Under this change, we actually end up in a bricked state where the file cannot be downloaded because the file doesn't exist but the
.lockdoes (though is empty/unlocked). This points to the fact that it's possible under the distributed case that this is why bothforce_downloadis required, and why there are more count lock increments than decrements.This PR removes the
.lockearly exits, and removes the lock increment in cases where the download is invoked by a force_download (as presumably it should already be locked).PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃