Skip to content

Conversation

mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Aug 28, 2025

Which issue does this PR close?

Closes #1906.

Rationale for this change

#1862 tried to implement RangePartitioning with native shuffle. The implementation didn't work because executors calculated their own partition boundaries.

What changes are included in this PR?

This modifies the flow for the driver to calculate the boundaries (like Spark). At a high level:

  • Hoist code from Spark's ShuffleExchangeExec for using Spark's RangePartitioner to calculate boundary rows.
  • Serialize boundary rows to native side.
  • Deserialize boundary rows and pass as part of the partitioning scheme. Each executor should have the boundary values now.

How are these changes tested?

Remaining concerns

  • What is the performance implication of iterating over these batches to calculate the bounds at the driver? Are we introducing significant overhead? It's what Spark does, but how does this approach affect Comet?
  • Should we remove the code for calculating boundary rows in native code? It's possible we (or someone else) might want this, so I left the tests and annotated them to allow dead code.
  • I change the default to enable this feature, but we can discuss setting it to false. For now I want to see how it does in CI.
  • I need to generate new golden plans if we decide to enable this feature by default.

@mbutrovich mbutrovich self-assigned this Aug 28, 2025
@codecov-commenter
Copy link

codecov-commenter commented Aug 28, 2025

Codecov Report

❌ Patch coverage is 86.66667% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.29%. Comparing base (f09f8af) to head (522ef80).
⚠️ Report is 430 commits behind head on main.

Files with missing lines Patch % Lines
...t/execution/shuffle/CometNativeShuffleWriter.scala 73.52% 4 Missing and 5 partials ⚠️
...t/execution/shuffle/CometShuffleExchangeExec.scala 96.29% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #2258       +/-   ##
=============================================
- Coverage     56.12%   44.29%   -11.83%     
- Complexity      976     1106      +130     
=============================================
  Files           119      143       +24     
  Lines         11743    13373     +1630     
  Branches       2251     2397      +146     
=============================================
- Hits           6591     5924      -667     
- Misses         4012     6420     +2408     
+ Partials       1140     1029      -111     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mbutrovich mbutrovich changed the title fix: RangePartitioning boundaries with native shuffle fix: RangePartitioning with native shuffle Aug 28, 2025
@mbutrovich
Copy link
Contributor Author

The next challenge to figure out is adding some flexibility for dictionary-encoded columns. The current approach with one schema is too rigid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RangePartitioning does not yield correct results with native shuffle
2 participants