Skip to content

Comments

Reshuffle the data after reading from cache if shuffle_rows is true#817

Merged
kashishmittal55 merged 12 commits intouber:masterfrom
kashishmittal55:kashish/shuffle-rows
Dec 16, 2025
Merged

Reshuffle the data after reading from cache if shuffle_rows is true#817
kashishmittal55 merged 12 commits intouber:masterfrom
kashishmittal55:kashish/shuffle-rows

Conversation

@kashishmittal55
Copy link
Collaborator

If shuffle_rows is true, we do want to shuffle the rows after reading it from cache.
In order to do so, once we read the data, we can reshuffle it. This means, the first time when we are warming up the cache, we will be doing double shuffling.

@codecov
Copy link

codecov bot commented Nov 7, 2025

Codecov Report

❌ Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.44%. Comparing base (b6fbf92) to head (740c524).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
petastorm/arrow_reader_worker.py 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #817      +/-   ##
==========================================
+ Coverage   86.39%   86.44%   +0.05%     
==========================================
  Files          84       84              
  Lines        5144     5158      +14     
  Branches      808      813       +5     
==========================================
+ Hits         4444     4459      +15     
+ Misses        558      557       -1     
  Partials      142      142              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arushi297
Copy link
Collaborator

@kashishmittal55 will the current reproducibility/randomness behavior remain intact with this change?

@kashishmittal55
Copy link
Collaborator Author

@kashishmittal55 will the current reproducibility/randomness behavior remain intact with this change?

Yes correct, it should not affect reproducibility since we will use the same random variable for different training runs.

@arushi297 arushi297 assigned arushi297 and unassigned arushi297 Dec 16, 2025
@arushi297 arushi297 self-requested a review December 16, 2025 19:21
@kashishmittal55 kashishmittal55 merged commit 3cae688 into uber:master Dec 16, 2025
6 checks passed
@kashishmittal55 kashishmittal55 deleted the kashish/shuffle-rows branch December 16, 2025 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants