Skip to content

Fix memory reservation starvation in sort-merge#20642

Open
xudong963 wants to merge 4 commits intoapache:mainfrom
xudong963:fix/sort-merge-reservation-starvation
Open

Fix memory reservation starvation in sort-merge#20642
xudong963 wants to merge 4 commits intoapache:mainfrom
xudong963:fix/sort-merge-reservation-starvation

Conversation

@xudong963
Copy link
Member

@xudong963 xudong963 commented Mar 2, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

This PR fixes memory reservation starvation in sort-merge when multiple sort partitions share a GreedyMemoryPool.

When multiple ExternalSorter instances run concurrently and share a single memory pool, the merge phase starves:

  1. Each partition pre-reserves sort_spill_reservation_bytes via merge_reservation
  2. When entering the merge phase, new_empty() was used to create a new reservation starting at 0 bytes, while the pre-reserved bytes sat idle in ExternalSorter.merge_reservation
  3. Those freed bytes were immediately consumed by other partitions racing for memory
  4. The merge could no longer allocate memory from the pool → OOM / starvation

What changes are included in this PR?

Are these changes tested?

I can't find a deterministic way to reproduce the bug, but it occurs in our production. Add an end-to-end test to verify the fix

Are there any user-facing changes?

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 2, 2026
@xudong963 xudong963 marked this pull request as draft March 2, 2026 09:49
@xudong963 xudong963 marked this pull request as ready for review March 2, 2026 10:09
@xudong963
Copy link
Member Author

xudong963 commented Mar 2, 2026

I can't find a deterministic way to reproduce the bug now but it occurs in our production. I'd like to get more eyes for the PR!

Update: I added an end-to-end test which fails on main

@xudong963 xudong963 force-pushed the fix/sort-merge-reservation-starvation branch from 0f2140c to 651e9c9 Compare March 2, 2026 21:15
Comment on lines +358 to +364
// Transfer the pre-reserved merge memory to the streaming merge
// using `take()` instead of `new_empty()`. This ensures the merge
// stream starts with `sort_spill_reservation_bytes` already
// allocated, preventing starvation when concurrent sort partitions
// compete for pool memory. `take()` moves the bytes atomically
// without releasing them back to the pool, so other partitions
// cannot race to consume the freed memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre reserved merge memory should be used as part of the sort merge stream.
I mean that if x pre reserved merge memory was reserved the sort merge stream should know about that so it wont think it starting from 0, otherwise this just reserve for unaccounted memory

coracuity added a commit to acuitymd/silk-chiffon that referenced this pull request Mar 4, 2026
DataFusion 52.1.0 has a TOCTOU race in ExternalSorter where merge
reservations are freed and re-created empty, letting other partitions
steal the memory (apache/datafusion#20642). Until the upstream fix
lands, compute a data-aware sort_spill_reservation_bytes by sampling
actual Arrow row sizes from the input, estimating spill file count,
and reserving enough for the merge phase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants