-
Notifications
You must be signed in to change notification settings - Fork 261
feat: Add sort-based shuffle implementation #1389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds an alternative shuffle implementation that writes a single consolidated file per input partition (sorted by output partition ID) along with an index file, similar to Apache Spark's sort-based shuffle. This approach reduces file count from N × M (N input partitions × M output partitions) to 2 × N files (one data + one index file per input partition). Key components: - SortShuffleWriterExec: ExecutionPlan implementation - PartitionBuffer: In-memory buffering per partition - SpillManager: Spill-to-disk when memory pressure is high - ShuffleIndex: Index file I/O (little-endian i64 offsets) - SortShuffleConfig: Configuration (buffer size, memory limit, spill threshold) New configuration options: - ballista.shuffle.sort_based.enabled (default: false) - ballista.shuffle.sort_based.buffer_size (default: 1MB) - ballista.shuffle.sort_based.memory_limit (default: 256MB) - ballista.shuffle.sort_based.spill_threshold (default: 0.8) This implementation is inspired by Apache DataFusion Comet's shuffle writer: https://github.com/apache/datafusion-comet/blob/main/native/core/src/execution/shuffle/shuffle_writer.rs Work remaining: - Update DistributedPlanner to use SortShuffleWriterExec when enabled - Update ShuffleReaderExec to detect and read sort shuffle format - Add protobuf serialization for the new execution plan - End-to-end integration tests Co-Authored-By: Claude Opus 4.5 <[email protected]>
This commit integrates the sort-based shuffle writer with the distributed planner and execution graph: - Add ShuffleWriter trait to provide a common interface for both ShuffleWriterExec and SortShuffleWriterExec - Update DistributedPlanner to return Vec<Arc<dyn ShuffleWriter>> - Update ExecutionStageBuilder to work with ShuffleWriter trait - Add protobuf serialization for SortShuffleWriterExec - Update execution_graph, execution_stage, and diagram modules The planner will now create SortShuffleWriterExec when: - ballista.shuffle.sort_based.enabled is true - The partitioning is Hash partitioning Co-Authored-By: Claude Opus 4.5 <[email protected]>
…chmarks - Update fetch_partition_local to detect sort-based shuffle format by checking for the presence of an index file - When sort shuffle is detected, read partition data using the index file to locate the correct byte range - Add shuffle_bench binary for comparing hash vs sort shuffle performance - Configurable number of rows, partitions, batch size - Reports timing, file count, and throughput metrics Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Update DefaultExecutionEngine to handle both ShuffleWriterExec and SortShuffleWriterExec by introducing ShuffleWriterVariant enum - Fix sort shuffle writer to store cumulative batch counts in index instead of placeholder byte offsets - Fix sort shuffle reader to correctly identify partition boundaries using the batch count index - Add comprehensive end-to-end integration tests for sort-based shuffle covering aggregations, GROUP BY, ORDER BY, and comparison with hash-based shuffle Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Update writer to use FileWriter instead of StreamWriter - Update reader to use FileReader with set_index() for direct batch access - FileReader supports random access via the IPC footer, eliminating the need to scan through preceding batches to reach the target partition - Index still stores cumulative batch counts for partition boundaries This improves read performance for later partitions (e.g., partition 15 of 16) by directly seeking to the needed batches instead of reading and discarding all preceding data. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Use is_multiple_of() instead of manual modulo check. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add FinalizeResult type alias to reduce type complexity - Add #[allow(clippy::too_many_arguments)] for finalize_output function - Use iter_mut().enumerate() instead of index-based loop Co-Authored-By: Claude Opus 4.5 <[email protected]>
When running the tpch benchmark, the --query parameter is now optional. If not specified, all 22 TPC-H queries will be run sequentially. Changes: - Make --query optional for both datafusion and ballista benchmarks - Run all 22 queries when --query is not specified - Only print SQL queries when --debug flag is enabled - Write a single JSON output file for the entire benchmark run - Fix parquet file path resolution for datafusion benchmarks - Simplify output when iterations=1 (no iteration number, no average) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add CLI option to enable sort-based shuffle in the TPC-H benchmark when running against Ballista. Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
I'll review this tomorrow morning |
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @andygrove,
I had a quick look, I think this approach makes sense. I'm not sure if it will be faster than hash in case of big shuffles, but benchmark will definetly help to understand that
two group of coments
- i would suggest to return stream of record batches rather than vecrtor of vector batches.
- remote shuffle read does not work at the moment. we probably need to update flight server to support new file format.
having in mind #1318, I'm not sure if/how can we avoid decompressing and decoding result file on mapper side, does it have to be valid ipc file? just idea for discussion
| //! This module provides an alternative to the hash-based shuffle that writes | ||
| //! a single consolidated file per input partition (sorted by output partition ID) | ||
| //! along with an index file mapping partition IDs to byte offsets. | ||
| //! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could add paragraph from referred spark documentation, I personally find it as good algorithm outline: "Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks."
| let config = SessionConfig::new_with_ballista() | ||
| .set_str(BALLISTA_SHUFFLE_SORT_BASED_ENABLED, "true") | ||
| .set_str(BALLISTA_SHUFFLE_SORT_BASED_BUFFER_SIZE, "1048576") // 1MB | ||
| .set_str(BALLISTA_SHUFFLE_SORT_BASED_MEMORY_LIMIT, "268435456"); // 256MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding .set_str(BALLISTA_SHUFFLE_READER_FORCE_REMOTE_READ, "true") will force remote read using flight. at the moment tests will fail if set. I would suggest using rstest to test local and remote read (contex with and without this option set)
| let config = SessionConfig::new_with_ballista() | ||
| .set_str(BALLISTA_SHUFFLE_SORT_BASED_ENABLED, "true") | ||
| .set_str(BALLISTA_SHUFFLE_SORT_BASED_BUFFER_SIZE, "1048576") // 1MB | ||
| .set_str(BALLISTA_SHUFFLE_SORT_BASED_MEMORY_LIMIT, "268435456"); // 256MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, one note, #1318 introduces block transfer rather than flight transfer to override and fail back to flight transport .set_str(BALLISTA_SHUFFLE_READER_REMOTE_PREFER_FLIGHT, "true") is needed.
I believe we would need to patch flight server to support transport of sorted shuffles
|
|
||
| //! Sort-based shuffle implementation for Ballista. | ||
| //! | ||
| //! This module provides an alternative to the hash-based shuffle that writes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest change from:
"This module provides an alternative to the hash-based shuffle that writes" to
"This module provides an alternative to the hash-based shuffle. It writes"
I personally find it easier to understand, but it's a personal option
| data_path: &Path, | ||
| index_path: &Path, | ||
| partition_id: usize, | ||
| ) -> Result<Vec<RecordBatch>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to return a stream instead of loading all vectors in a single vector?
| /// | ||
| /// # Returns | ||
| /// Vector of all record batches in the file. | ||
| pub fn read_all_batches(data_path: &Path) -> Result<Vec<RecordBatch>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to return a stream instead of loading all vectors in a single vector
| } | ||
|
|
||
| /// Reads all spill files for a partition and returns the batches. | ||
| pub fn read_spill_files(&self, partition_id: usize) -> Result<Vec<RecordBatch>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to avoid vector and use stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep this as vector, perhaps just return StreamReader?
| reader.schema() | ||
| }; | ||
|
|
||
| let stream = futures::stream::iter(batches.into_iter().map(Ok)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, getting stream from reader instead of all batches may make sense
|
one note, maybe it would make sense to test with |
…eader Change sort shuffle partition reading to return a lazy stream that yields batches on-demand rather than collecting all batches into memory upfront. This reduces memory usage for large partitions. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add algorithm description from Apache Spark documentation and improve wording for clarity. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update the flight server's do_get handler to detect sort-based shuffle files and read only the relevant partition using the index file. This enables remote shuffle reads for sort-based shuffle. Also add detection in IO_BLOCK_TRANSPORT to return an informative error since block transport doesn't support sort-based shuffle (it transfers the entire file which contains all partitions). Co-Authored-By: Claude Opus 4.5 <[email protected]>
e750e7b to
53b5d57
Compare
Use rstest to parameterize sort shuffle tests to run with both local reads and remote reads via the flight service. This ensures the flight service correctly handles sort-based shuffle partition requests. Tests now run in two modes: - Local: reads shuffle data from local filesystem (default) - RemoteFlight: forces remote read through flight service Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Thanks for the review @milenkovicm! I've addressed the feedback in the following commits: Addressed1. Return stream instead of
|
Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Thanks @andygrove will try to have a look later. One question, I'm not sure if it makes sense.
|
That's an interesting idea. I will experiment with that in a separate PR. |
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks good to me @andygrove
there is one vector which should be stream, and a minor nitpicking
| return Ok(Response::new( | ||
| Box::pin(flight_data_stream) as Self::DoGetStream | ||
| )); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicking, consider adding else branch
| } | ||
|
|
||
| /// Reads all spill files for a partition and returns the batches. | ||
| pub fn read_spill_files(&self, partition_id: usize) -> Result<Vec<RecordBatch>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to keep this as vector, perhaps just return StreamReader?
|
Thanks for the review @milenkovicm. I'll go ahead and merge and will keep iterating on this to address the remaining feedback. |
Summary
This PR implements a sort-based shuffle mechanism for Ballista, similar to Spark's approach. Instead of creating one file per output partition (N×M files for N input partitions and M output partitions), each input partition writes to a single consolidated file with an index file mapping partition IDs to batch ranges.
Related Issues
Key Features
Architecture
Configuration Options
ballista.shuffle.sort_based.enabledballista.shuffle.sort_based.buffer_sizeballista.shuffle.sort_based.memory_limitballista.shuffle.sort_based.spill_thresholdChanges
Core
sort_shufflemodule with all components (buffer, spill, index, writer, reader)ShuffleWritertrait for polymorphic shuffle writer handlingshuffle_reader.rsto detect and read sort-based shuffle formatExecutor
DefaultExecutionEngineto handle bothShuffleWriterExecandSortShuffleWriterExecShuffleWriterVariantenum for type-safe dispatchScheduler/Planner
DistributedPlannerto createSortShuffleWriterExecwhen enabledExecutionStageBuilderto useShuffleWritertraitBenchmarks
shuffle_benchbinary for comparing hash vs sort shuffle performanceBenchmark Results
Configuration: 1M rows, 4 input partitions, 16 output partitions
Performance Improvements:
Test Plan
ballista/client/tests/sort_shuffle.rs)Running the Benchmark
References
This implementation is inspired by:
🤖 Generated with Claude Code