-
Notifications
You must be signed in to change notification settings - Fork 234
fix: RangePartitioning with native shuffle #2258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…for native shuffle to consume. Added new test to represent apache#1906.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2258 +/- ##
=============================================
- Coverage 56.12% 44.29% -11.83%
- Complexity 976 1106 +130
=============================================
Files 119 143 +24
Lines 11743 13373 +1630
Branches 2251 2397 +146
=============================================
- Hits 6591 5924 -667
- Misses 4012 6420 +2408
+ Partials 1140 1029 -111 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The next challenge to figure out is adding some flexibility for dictionary-encoded columns. The current approach with one schema is too rigid. |
…ow to handle dictionary encoding.
Which issue does this PR close?
Closes #1906.
Rationale for this change
#1862 tried to implement RangePartitioning with native shuffle. The implementation didn't work because executors calculated their own partition boundaries.
What changes are included in this PR?
This modifies the flow for the driver to calculate the boundaries (like Spark). At a high level:
ShuffleExchangeExec
for using Spark'sRangePartitioner
to calculate boundary rows.How are these changes tested?
Remaining concerns