Skip to content

Conversation

@andygrove
Copy link
Member

Summary

Two improvements to the shuffle writer:

  1. Add BufWriter for buffered file I/O

    • Wrapping File with BufWriter reduces syscalls when writing multiple small batches to shuffle files
    • Applied to both the hash-partitioned shuffle path in ShuffleWriterExec and the write_stream_to_disk utility
  2. Fix file size read before writer finish (bug fix)

    • Previously, fs::metadata() was called before writer.finish(), which could report incorrect file sizes
    • Data may not have been fully flushed to disk, especially now that BufWriter is used
    • Fixed by swapping the order: call finish() first, then read the file size

Note

This PR was created with AI assistance using Claude Code. All changes were reviewed and approved by a human maintainer.

Test plan

  • Existing shuffle_writer tests pass (cargo test -p ballista-core shuffle_writer)
  • Manual testing with distributed queries to verify shuffle performance

1. Add BufWriter for buffered file I/O in shuffle writer

   Wrapping File with BufWriter reduces syscalls when writing
   multiple small batches to shuffle files. This affects both
   the hash-partitioned shuffle path and the non-partitioned
   write_stream_to_disk utility.

2. Fix file size read before writer finish

   Previously, fs::metadata() was called before writer.finish(),
   which could report incorrect file sizes since data may not
   have been fully flushed to disk. This is especially important
   now that BufWriter is used, as buffered data would not be
   reflected in the file size until after finish() flushes it.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@andygrove andygrove force-pushed the shuffle-writer-optimizations branch from 5b3e9ea to 53c71b6 Compare January 17, 2026 16:30
if let Some(w) = w {
let num_bytes = fs::metadata(&w.path)?.len();
w.writer.finish()?;
let num_bytes = fs::metadata(&w.path)?.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took me a moment to understand whats changed

Copy link
Contributor

@milenkovicm milenkovicm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @andygrove

@andygrove andygrove merged commit 7c62227 into apache:main Jan 17, 2026
27 checks passed
@andygrove andygrove deleted the shuffle-writer-optimizations branch January 17, 2026 21:43
@andygrove
Copy link
Member Author

Thanks for the review @milenkovicm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants