Skip to content

Add option to pickler dumps() for best-effort determinism#34698

Merged
tvalentyn merged 5 commits intoapache:masterfrom
AdrS:sort-sets
Apr 25, 2025
Merged

Add option to pickler dumps() for best-effort determinism#34698
tvalentyn merged 5 commits intoapache:masterfrom
AdrS:sort-sets

Conversation

@AdrS
Copy link
Copy Markdown
Contributor

@AdrS AdrS commented Apr 21, 2025

The motivation for this change is Flume caches pickled code and non-determinism breaks the caching. While making pickling fully-determinism is infeasible, increasing the determinism is still useful due to the increase in cache hits. Sets are a common source of non-determinism. This change sorts set elements to provide deterministic serialization. Because not all types provide a comparison operator, the sorting routine implements generic element comparison logic.

See: #34410

The motivation for this change is Flume caches pickled code and
non-determinism breaks the caching. While making pickling fully-determinism
is infeasible, increasing the determinism is still useful due to the
increase in cache hits. Sets are a common source of non-determinism.
This change sorts set elements to provide deterministic serialization.
Because not all types provide a comparison operator, the sorting routine
implements generic element comparison logic.

See: apache#34410
@github-actions
Copy link
Copy Markdown
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @tvalentyn for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@tvalentyn
Copy link
Copy Markdown
Contributor

cc: @claudevdm

The FlumeRunner will enable this feature flag via its canary mechansim.
The option is localized to only serialization of transforms because that
is what the FlumeRunner wants to cache. It will not enable best-effort
deterministic serialization for other uses of pickling.
@github-actions github-actions bot added runners and removed runners labels Apr 24, 2025
@tvalentyn tvalentyn closed this Apr 25, 2025
@tvalentyn tvalentyn reopened this Apr 25, 2025
@tvalentyn
Copy link
Copy Markdown
Contributor

rerunning tests

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 25, 2025

Codecov Report

Attention: Patch coverage is 97.93814% with 2 lines in your changes missing coverage. Please review.

Project coverage is 54.63%. Comparing base (10775fe) to head (f2e4c29).
Report is 57 commits behind head on master.

Files with missing lines Patch % Lines
sdks/python/apache_beam/internal/set_pickler.py 97.43% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #34698      +/-   ##
============================================
+ Coverage     54.61%   54.63%   +0.02%     
  Complexity     1480     1480              
============================================
  Files          1008     1010       +2     
  Lines        159767   159960     +193     
  Branches       1079     1079              
============================================
+ Hits          87262    87402     +140     
- Misses        70403    70456      +53     
  Partials       2102     2102              
Flag Coverage Δ
python 81.25% <97.93%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@tvalentyn tvalentyn merged commit ae7bf20 into apache:master Apr 25, 2025
176 of 182 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants