Skip to content

Remove .lock download skipping, skip locks on force download#519

Merged
tchaton merged 8 commits intoLightning-AI:mainfrom
JackUrb:ju-downloader-fix
Mar 26, 2025
Merged

Remove .lock download skipping, skip locks on force download#519
tchaton merged 8 commits intoLightning-AI:mainfrom
JackUrb:ju-downloader-fix

Conversation

@JackUrb
Copy link
Contributor

@JackUrb JackUrb commented Mar 17, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Part of #512, but maybe not the full story?

Under this change, we actually end up in a bricked state where the file cannot be downloaded because the file doesn't exist but the .lock does (though is empty/unlocked). This points to the fact that it's possible under the distributed case that this is why both force_download is required, and why there are more count lock increments than decrements.

This PR removes the .lock early exits, and removes the lock increment in cases where the download is invoked by a force_download (as presumably it should already be locked).

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@bhimrazy
Copy link
Collaborator

bhimrazy commented Mar 18, 2025

It looks like s5cmd already has a default retry mechanism with default 10 attempts. Could this be a different case?

Also, s5cmd provides trace logging, which might help in diagnosing the actual issue:

 --log value                    log level: (trace, debug, info, error) (default: info)

Would checking the trace logs provide more insights for failing behind s5cmd?

@tchaton
Copy link
Collaborator

tchaton commented Mar 20, 2025

Hey @JackUrb. Any updates on this PR ? The retry should be taken care of. Something is off with s5cmd then. Do you see any errors ?

What if we just remove this block ?

        if os.path.exists(local_filepath + ".lock"):
            return

@JackUrb
Copy link
Contributor Author

JackUrb commented Mar 20, 2025

Hi folks, we're still testing this out. It has made most jobs fully stable, but some not. Will update once I've gotten to the bottom of things

@JackUrb JackUrb marked this pull request as draft March 20, 2025 12:21
@tchaton
Copy link
Collaborator

tchaton commented Mar 20, 2025

Hey @JackUrb Feel free to revert the changes to src/litdata/streaming/downloader.py.

@tchaton
Copy link
Collaborator

tchaton commented Mar 21, 2025

Hey @JackUrb. Any luck ?

@JackUrb
Copy link
Contributor Author

JackUrb commented Mar 22, 2025

Sorry, am on PTO. The previous round of changes still hasn't gotten us to 100% stability though, some runs are still ending up with file not found errors.

@tchaton
Copy link
Collaborator

tchaton commented Mar 22, 2025

Sorry, am on PTO. The previous round of changes still hasn't gotten us to 100% stability though, some runs are still ending up with file not found errors.

Let s revert my PR then.

@JackUrb
Copy link
Contributor Author

JackUrb commented Mar 23, 2025

Alright it seems we're stable again with the PR reverted and a skip_cache option for force download, will push here.

@JackUrb JackUrb changed the title Allow download retries, remove .lock download skipping. Remove .lock download skipping, skip locks on force download Mar 23, 2025
@JackUrb JackUrb marked this pull request as ready for review March 23, 2025 22:50
@tchaton
Copy link
Collaborator

tchaton commented Mar 24, 2025

Hey @JackUrb Does the skip logic helps with this #512 ?

@JackUrb
Copy link
Contributor Author

JackUrb commented Mar 24, 2025

Testing that on my next run, didn't start on a clean cache this time

@codecov
Copy link

codecov bot commented Mar 24, 2025

Codecov Report

Attention: Patch coverage is 58.33333% with 5 lines in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (493fd24) to head (ed469da).
Report is 4 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #519   +/-   ##
===================================
- Coverage    79%    79%   -0%     
===================================
  Files        39     39           
  Lines      5888   5892    +4     
===================================
+ Hits       4642   4644    +2     
- Misses     1246   1248    +2     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tchaton
Copy link
Collaborator

tchaton commented Mar 24, 2025

Hey @JackUrb. Is it ready for review and being merged ?

@tchaton
Copy link
Collaborator

tchaton commented Mar 26, 2025

Hey @JackUrb. Any updates ? Should we land this PR ?

@JackUrb
Copy link
Contributor Author

JackUrb commented Mar 26, 2025

Feel free to commandeer/edit and merge, I'm using this version with tombstoning for debugging

@tchaton tchaton merged commit ca4fbe9 into Lightning-AI:main Mar 26, 2025
29 checks passed
@tchaton
Copy link
Collaborator

tchaton commented Mar 26, 2025

Hey @JackUrb for the contribution. I made a new release with your fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants