-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Remove coalesce batches rule and deprecate CoalesceBatchesExec #19622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CoalesceBatchesExec
|
Looks great, can you fix the clippy errors @feniljain ? |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
QQuery 23 in clickbench_partitioned just halved the time 😲 |
I think this is not correct, it is super flaky lately. @alamb do you know the source of this? Is it running OOM / swapping perhaps or slow / not predictable block storage? |
Ouuh
Did you mean to tag Andrew instead of bot? |
Yeah @alamb :D |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
QQuery 23 still seems to be leading ahead! |
I think it is the same flakiness |
datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs
Outdated
Show resolved
Hide resolved
I suspect this has to do with timing. Basically Q23 is like This can now takes advantage of dynamic filter pushdown which can prune some of the files. However, the updates of the dynamic filters is not deterministic so I think that accounts for the variance run to run -- depending on what rows have been seen already will potentially skip entire files I suspect this accounts for the non determinsim I will see if I can reproduce the results locally |
|
run benchmark clickbench_partitioned |
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @feniljain -- this looks great to me
| Boundedness::Bounded => memory_exec_sorted(&schema, ordering.clone()), | ||
| }; | ||
| let repartition_rr = repartition_exec_round_robin(source); | ||
| let repartition_hash = repartition_exec_hash(repartition_rr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for this PR, but I think now actually we have fixed the double repartition as well, as described in https://datafusion.apache.org/blog/output/2025/12/15/avoid-consecutive-repartitions/
@gene-bordegaray is that correct? if so I will file a ticket to try and update these tests so they match what datafusion creates now (a single round robin hash)
|
🤖: Benchmark completed Details
|
|
Epic work @feniljain -- thanks again |
Which issue does this PR close?
CoalesceBatchesoptimizer rule #19591Rationale for this change
Explained in issue itself
What changes are included in this PR?
CoalesceBatchesExecAre these changes tested?
Yes
Are there any user-facing changes?
Yes, added a deprecation tag on
CoalesceBatchesExec