Skip to content

Commit 0445c48

Browse files
andygroveclaude
andcommitted
docs: improve sort shuffle module documentation
Add algorithm description from Apache Spark documentation and improve wording for clarity. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent d4fcf1a commit 0445c48

File tree

1 file changed

+7
-2
lines changed
  • ballista/core/src/execution_plans/sort_shuffle

1 file changed

+7
-2
lines changed

ballista/core/src/execution_plans/sort_shuffle/mod.rs

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,17 @@
1717

1818
//! Sort-based shuffle implementation for Ballista.
1919
//!
20-
//! This module provides an alternative to the hash-based shuffle that writes
20+
//! This module provides an alternative to the hash-based shuffle. It writes
2121
//! a single consolidated file per input partition (sorted by output partition ID)
22-
//! along with an index file mapping partition IDs to byte offsets.
22+
//! along with an index file mapping partition IDs to batch ranges.
2323
//!
2424
//! This approach reduces file count from `N × M` (N input partitions × M output partitions)
2525
//! to `2 × N` files (one data + one index per input partition).
26+
//!
27+
//! The algorithm follows the approach used by Apache Spark: internally, results from
28+
//! individual map tasks are kept in memory until they can't fit. Then, these are
29+
//! sorted based on the target partition and written to a single file. On the reduce
30+
//! side, tasks read the relevant sorted blocks.
2631
2732
mod buffer;
2833
mod config;

0 commit comments

Comments
 (0)