inital version for reverse row groups by zhuqi-lucas · Pull Request #26 · massive-com/arrow-datafusion

zhuqi-lucas · 2025-12-17T12:28:49Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Copilot

Pull request overview

This PR implements Phase 1 of sort pushdown optimization to improve TopK query performance. When a query requests data in reverse order of a Parquet file's natural ordering, the optimizer now enables reverse row group scanning, which allows early termination in TopK queries while keeping the Sort operator for correctness.

Key changes:

Adds enable_sort_pushdown configuration option (default: true)
Implements reverse row group scanning for Parquet files
Returns inexact ordering to enable TopK early termination benefits
Adds comprehensive test coverage across multiple file formats

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
docs/source/user-guide/configs.md	Documents new configuration options including `enable_sort_pushdown`, `force_filter_selections`, `enable_ansi_mode`, and hash join InList pushdown settings
datafusion/common/src/config.rs	Adds `enable_sort_pushdown` configuration option with detailed documentation
datafusion/physical-optimizer/src/pushdown_sort.rs	Implements the PushdownSort optimizer rule that detects SortExec nodes and attempts to push sort requirements down to data sources
datafusion/physical-plan/src/sort_pushdown.rs	Defines `SortOrderPushdownResult` enum for communicating sort pushdown results (Exact, Inexact, Unsupported)
datafusion/physical-plan/src/execution_plan.rs	Adds `try_pushdown_sort` trait method to ExecutionPlan for sort optimization
datafusion/datasource-parquet/src/source.rs	Implements reverse row group scanning logic in ParquetSource with `reverse_row_groups` field
datafusion/datasource-parquet/src/sort.rs	Implements `reverse_row_selection` function to adjust row selections for reversed row group order
datafusion/datasource-parquet/src/opener.rs	Integrates reverse scanning into ParquetOpener using PreparedAccessPlan
datafusion/physical-expr-common/src/sort_expr.rs	Adds `is_reverse` and `is_reversed_sort_options` helpers for detecting reversed orderings
datafusion/sqllogictest/test_files/*.slt	Comprehensive SQL logic tests validating reverse scan behavior with various scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…#19557) - Closes [apache#19535](apache#19535) Reverse row selection should respect the row group index, this PR will fix the issue. Reverse row selection should respect the row group index, this PR will fix the issue. Yes No (cherry picked from commit 27de50d)

## Which issue does this PR close? Add sorted data benchmark. - Closes[ apache#18976](apache#18976) ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested? Yes, test results for reverse parquet PR, it's 30X faster than main branch for sorted data: apache#18817 ```rust Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench clickbench --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data --sorted-by EventTime --sort-order ASC -o /Users/zhuqi/arrow-datafusion/benchmarks/results/reverse_parquet/data_sorted_clickbench.json` Running benchmarks with the following options: RunOpt { query: None, pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet", queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/reverse_parquet/data_sorted_clickbench.json"), sorted_by: Some("EventTime"), sort_order: "ASC" } ⚠️ Forcing target_partitions=1 to preserve sort order ⚠️ (Because we want to get the pure performance benefit of sorted data to compare) 📊 Session config target_partitions: 1 Registering table with sort order: EventTime ASC Executing: CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet' WITH ORDER ("EventTime" ASC) Q0: -- Must set for ClickBench hits_partitioned dataset. See apache#16591 -- set datafusion.execution.parquet.binary_as_string = true SELECT * FROM hits ORDER BY "EventTime" DESC limit 10; Query 0 iteration 0 took 14.7 ms and returned 10 rows Query 0 iteration 1 took 10.2 ms and returned 10 rows Query 0 iteration 2 took 8.7 ms and returned 10 rows Query 0 iteration 3 took 7.9 ms and returned 10 rows Query 0 iteration 4 took 7.9 ms and returned 10 rows Query 0 avg time: 9.85 ms + set +x Done ``` And the main branch result: ```rust Running `/Users/zhuqi/arrow-datafusion/target/release/dfbench clickbench --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet --queries-path /Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data --sorted-by EventTime --sort-order ASC -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_18976/data_sorted_clickbench.json` Running benchmarks with the following options: RunOpt { query: None, pushdown: false, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet", queries_path: "/Users/zhuqi/arrow-datafusion/benchmarks/queries/clickbench/queries/sorted_data", output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_18976/data_sorted_clickbench.json"), sorted_by: Some("EventTime"), sort_order: "ASC" } ⚠️ Forcing target_partitions=1 to preserve sort order ⚠️ (Because we want to get the pure performance benefit of sorted data to compare) 📊 Session config target_partitions: 1 Registering table with sort order: EventTime ASC Executing: CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/hits_0_sorted.parquet' WITH ORDER ("EventTime" ASC) Q0: -- Must set for ClickBench hits_partitioned dataset. See apache#16591 -- set datafusion.execution.parquet.binary_as_string = true SELECT * FROM hits ORDER BY "EventTime" DESC limit 10; Query 0 iteration 0 took 331.1 ms and returned 10 rows Query 0 iteration 1 took 286.0 ms and returned 10 rows Query 0 iteration 2 took 283.3 ms and returned 10 rows Query 0 iteration 3 took 283.8 ms and returned 10 rows Query 0 iteration 4 took 286.5 ms and returned 10 rows Query 0 avg time: 294.13 ms + set +x Done ``` ## Are there any user-facing changes?   --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com> Co-authored-by: Yongting You <2010youy01@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> (cherry picked from commit cde6dfa)

Copilot AI review requested due to automatic review settings December 17, 2025 12:28

github-actions bot added documentation Improvements or additions to documentation physical-expr optimizer core sqllogictest physical-plan datasource proto common execution labels Dec 17, 2025

Copilot started reviewing on behalf of zhuqi-lucas December 17, 2025 12:29 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

github-actions bot removed the proto label Dec 17, 2025

support reverse files and row groups for dynamic topk

4ed0668

zhuqi-lucas force-pushed the branch-51-reverse-row-group branch from fd45ae8 to 4ed0668 Compare December 18, 2025 03:45

zhuqi-lucas and others added 10 commits December 23, 2025 15:40

use new design

1ae6efa

fix

a4b686c

fix

97a7174

fix

667c3a2

fix

419fea5

fix

36682cb

fix

f5f96f7

Merge branch 'branch-51' into branch-51-reverse-row-group

8b61380

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inital version for reverse row groups#26

inital version for reverse row groups#26
zhuqi-lucas wants to merge 11 commits intobranch-51from
branch-51-reverse-row-group

zhuqi-lucas commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhuqi-lucas commented Dec 17, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant