infer parquet file order from metadata and use it to optimize scans #19433

adriangb · 2025-12-21T05:00:00Z

The idea here is to use the metadata in parquet files to infer sort orders, thus it is not required for users to specify it manually.

This should probably be split into multiple PRs:

Record sort order when writing into a table created as WITH ORDER
Refactor PartitionedFile construction
Collect ordering during statistics collection

adriangb · 2025-12-21T05:00:16Z

@zhuqi-lucas you may be intersted in this

zhuqi-lucas · 2025-12-21T08:35:36Z

Great work @adriangb, i will review it next week !

zhuqi-lucas · 2025-12-23T08:54:15Z

datafusion/catalog-listing/src/table.rs

+    // Check that all orderings are identical
+    let first = all_orderings[0];
+    for ordering in &all_orderings[1..] {
+        if *ordering != first {


Do we want to support more relax cases, for example to use similar to:

datafusion/datafusion/physical-expr/src/equivalence/properties/mod.rs

Line 533 in e6faacb

pub fn ordering_satisfy(

zhuqi-lucas · 2025-12-23T08:55:40Z

datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs

  SELECT 1372708800 + value AS t
  FROM generate_series(0, 99999)
-  ORDER BY t
+  ORDER BY t + 1


Why this changes to t+1?

Because it was picking up the ordering and causing even less data to be read 😆

I'll note I had to go with ORDER BY t * t in the end, even ORDER BY t + 1 is optimized away now!

zhuqi-lucas · 2025-12-23T09:00:13Z

datafusion/sqllogictest/test_files/parquet.slt


 statement ok
 DROP TABLE t;
+


It seems only test ASC NULLS FIRST. Better to add test cases for:
- DESC ordering
- Multi-column ordering
- Files with no ordering metadata

zhuqi-lucas · 2025-12-23T09:05:00Z

datafusion/physical-expr-common/src/sort_expr.rs

    }

+    /// Checks if `other` is a prefix of this `LexOrdering`.
+    pub fn is_prefix(&self, other: &LexOrdering) -> bool {


I meet some similar topic for the sort pushdown, and submitted a PR, i prefer to use eq_properties to do similar things, it will handle more cases after my try in my internal project.

The ordering_satisfy will handle more cases than prefix.

I gave it a review - looks nice! I can sequence this behind your PR if needed.

zhuqi-lucas · 2025-12-23T09:06:01Z

Great work @adriangb i left some initial review comments, i will review again for more details.

adriangb · 2025-12-31T01:07:31Z

@zhuqi-lucas ping to take a look at this when you get a chance. Thanks!

zhuqi-lucas

LGTM, thank you @adriangb for the great work, left some comments and question.

zhuqi-lucas · 2025-12-31T06:03:57Z

datafusion/catalog-listing/src/table.rs

+            if let Some(ordering) = &file.ordering {
+                all_orderings.push(ordering);
+            } else {
+                // If any file has no ordering, we can't derive a common ordering


We may add some debug or tracing log here:

Suggested change

// If any file has no ordering, we can't derive a common ordering

// If any file has no ordering, we can't derive a common ordering

tracing!(

"Cannot derive common ordering: file {} has different ordering ({:?}) than first file ({:?})",

file.object_meta.location, ordering, first

);

zhuqi-lucas · 2025-12-31T06:08:35Z

datafusion/execution/src/cache/cache_manager.rs

    CacheAccessor<Path, Arc<Statistics>, Extra = ObjectMeta>
 {
    /// Retrieves the information about the entries currently cached.
    fn list_entries(&self) -> HashMap<Path, FileStatisticsCacheEntry>;


Is it useful that we can add the has_ordering info to the FileStatisticsCacheEntry?

FileStatisticsCacheEntry { object_meta: object_meta.clone(), num_rows: stats.num_rows, num_columns: stats.column_statistics.len(), table_size_bytes: stats.total_byte_size, statistics_size_bytes: 0, has_ordering, // NEW: indicates if ordering is cached },

zhuqi-lucas · 2025-12-31T06:11:14Z

datafusion/catalog-listing/src/table.rs

+    // Check that all orderings are identical
+    let first = all_orderings[0];
+    for ordering in &all_orderings[1..] {
+        if *ordering != first {


Is it possible to use ordering_satisfy?

datafusion/datasource/src/file_format.rs

zhuqi-lucas · 2025-12-31T06:15:26Z

datafusion/datasource-parquet/src/file_format.rs

+            // may not be compatible with Parquet sorting columns (e.g. ordering on `random()`).
+            // So if we cannot create a Parquet sorting column from the ordering requirement,
+            // we skip setting sorting columns on the Parquet sink.
+            lex_ordering_to_sorting_columns(&ordering).ok()


Great point!

zhuqi-lucas · 2025-12-31T06:23:00Z

datafusion/execution/src/cache/cache_unit.rs

+    /// Cached file orderings, keyed by file path.
+    /// Stored separately from statistics to maintain backwards compatibility
+    /// with the FileStatisticsCache trait interface.
+    orderings: DashMap<Path, (ObjectMeta, Option<LexOrdering>)>,


Is it possible we just use one map:

pub struct DefaultFileStatisticsCache { // Store both stats and ordering together cache: DashMap<Path, CacheEntry>, } struct CacheEntry { meta: ObjectMeta, statistics: Arc<Statistics>, ordering: Option<LexOrdering>, // Initially None until fetched }

I went down a rabbit hole of reworking the cache... there was a lot of weird stuff going on (made worse by this PR). Each value in the cache essentially comprised 3 parts:

Statistics

Ordering

"Extra" i.e. cache invalidation metadata
I tried to unify it all into 1 value... but it's a decent amount of code churn. I'll split this PR up.

zhuqi-lucas · 2025-12-31T06:26:23Z

datafusion/physical-expr-common/src/sort_expr.rs

+        // [a ASC, b DESC] is NOT a prefix of [a ASC, b ASC, c ASC] (mismatch in middle)
+        assert!(!ordering_abc.is_prefix(&ordering_ab));
+    }



Maybe add some tests for file modification:

#[tokio::test] async fn test_ordering_cache_invalidation_on_file_modification() { let cache = DefaultFileStatisticsCache::default(); let path = Path::from("test.parquet"); // Cache with original metadata let meta_v1 = ObjectMeta { location: path.clone(), last_modified: Utc.timestamp_nanos(1000), size: 100, e_tag: None, version: None, }; let ordering_v1 = Some(LexOrdering::new(vec![...]).unwrap()); cache.put_ordering(&path, ordering_v1.clone(), &meta_v1); // Verify cached assert_eq!(cache.get_ordering(&path, &meta_v1), Some(ordering_v1.clone())); // File modified (size changed) let meta_v2 = ObjectMeta { last_modified: Utc.timestamp_nanos(2000), size: 200, // ← changed ..meta_v1.clone() }; // Should return None (cache miss due to size mismatch) assert_eq!(cache.get_ordering(&path, &meta_v2), None); // Cache new version let ordering_v2 = Some(LexOrdering::new(vec![...]).unwrap()); cache.put_ordering(&path, ordering_v2.clone(), &meta_v2); // Old metadata should still be invalid assert_eq!(cache.get_ordering(&path, &meta_v1), None); // New metadata should work assert_eq!(cache.get_ordering(&path, &meta_v2), Some(ordering_v2)); }

added in 25a5aa0

zhuqi-lucas · 2025-12-31T06:30:12Z

datafusion/datasource-parquet/src/metadata.rs

+    let column_idx = sorting_col.column_idx as usize;
+
+    // Get the column path from the Parquet schema
+    // The column_idx in SortingColumn refers to leaf columns


Should we add test to nest type, it also can be sorted?

I am not sure if it's possible?

zhuqi-lucas · 2025-12-31T06:32:56Z

datafusion/catalog-listing/src/table.rs

+            let ordering = self
+                .options
+                .format
+                .infer_ordering(ctx, store, Arc::clone(&self.file_schema), meta)


Here we get the cached statistic, but we still need to compute the ordering info.

If it's possible we only have one entry for it, and cached it or not. And the cache including the computed ordering info.

zhuqi-lucas · 2025-12-31T11:24:20Z

FYI @adriangb, i will be out for next 3 days, feel free to merge.

alamb

Thanks @zhuqi-lucas and @adriangb -- I took a quick skim of this and it looks like a great idea. Let me know if there is anything else that needs a review -- otherwise 🚀

alamb · 2026-01-01T14:36:45Z

datafusion/core/src/test_util/parquet.rs

+    }
+}
+
+/// Creates a test Parquet file with sorting_columns metadata.


TIL: https://github.com/apache/parquet-format/blob/4b1c72c837bec5b792b2514f0057533030fcedf8/src/main/thrift/parquet.thrift#L1018-L1021

I didn't realize this was part of the parquet metadata, but it appears to have been that way for 13 years 👍

adriangb · 2026-01-01T21:53:12Z

I've split out a couple of PRs from this:

…uctor (#19596) ## Which issue does this PR close? Part of #19433 ## Rationale for this change In preparation for ordering inference from Parquet metadata, we need to be able to store per-file ordering information on `PartitionedFile`. This PR adds the necessary infrastructure. ## What changes are included in this PR? - Add `ordering: Option<LexOrdering>` field to `PartitionedFile` struct - Add `new_from_meta(ObjectMeta)` constructor for creating files from metadata (cleaner than manually constructing) - Add `with_ordering()` builder method to set ordering information - Add `with_partition_values()` builder method for consistency - Update all `PartitionedFile` constructors to initialize `ordering: None` - Update callsites in test_util, proto, and substrait to use `new_from_meta` ## Are these changes tested? Yes, existing tests pass. The new field is currently always `None` so no new tests are needed yet. Tests for ordering inference will come in a follow-up PR. ## Are there any user-facing changes? No user-facing API changes. The `ordering` field is public but users typically construct `PartitionedFile` via the provided constructors which handle this automatically. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <[email protected]> Co-authored-by: Copilot <[email protected]>

## Which issue does this PR close? Part of #19433 ## Rationale for this change In preparation for ordering inference from Parquet metadata, the cache system needs refactoring to: 1. Support storing ordering information alongside statistics 2. Simplify the `CacheAccessor` trait by removing the `Extra` associated type and `*_with_extra` methods 3. Move validation logic into typed wrapper structs with explicit `is_valid_for()` methods ## What changes are included in this PR? ### Simplify `CacheAccessor` trait **Before:** ```rust pub trait CacheAccessor<K, V>: Send + Sync { type Extra: Clone; fn get(&self, k: &K) -> Option<V>; fn get_with_extra(&self, k: &K, e: &Self::Extra) -> Option<V>; fn put(&self, key: &K, value: V) -> Option<V>; fn put_with_extra(&self, key: &K, value: V, e: &Self::Extra) -> Option<V>; // ... other methods } ``` **After:** ```rust pub trait CacheAccessor<K, V>: Send + Sync { fn get(&self, key: &K) -> Option<V>; fn put(&self, key: &K, value: V) -> Option<V>; // ... other methods (no Extra type, no *_with_extra methods) } ``` ### Introduce typed wrapper structs for cached values Instead of passing validation metadata separately via `Extra`, embed it in the cached value type: - **`CachedFileMetadata`** - contains `meta: ObjectMeta`, `statistics: Arc<Statistics>`, `ordering: Option<LexOrdering>` - **`CachedFileList`** - contains `files: Arc<Vec<ObjectMeta>>` with `filter_by_prefix()` helper - **`CachedFileMetadataEntry`** - contains `meta: ObjectMeta`, `file_metadata: Arc<dyn FileMetadata>` Each wrapper has an `is_valid_for(&ObjectMeta)` method that checks if the cached entry is still valid (size and last_modified match). ### New validation pattern The typical usage pattern changes from: ```rust // Before: validation hidden in get_with_extra if let Some(stats) = cache.get_with_extra(&path, &object_meta) { // use stats } ``` To: ```rust // After: explicit validation if let Some(cached) = cache.get(&path) { if cached.is_valid_for(&object_meta) { // use cached.statistics } } ``` ### Add ordering support - `CachedFileMetadata` has new `ordering: Option<LexOrdering>` field - `FileStatisticsCacheEntry` has new `has_ordering: bool` field for introspection ## Are these changes tested? Yes, existing cache tests pass plus new tests for ordering support. ## Are there any user-facing changes? Breaking change to cache traits. Users with custom cache implementations will need to: 1. Update `CacheAccessor` impl to remove `Extra` type and `*_with_extra` methods 2. Update cached value types to the new wrappers (`CachedFileMetadata`, etc.) 3. Update callsites to use the new validation pattern with `is_valid_for()` 🤖 Generated with [Claude Code](https://claude.com/claude-code)

## Which issue does this PR close? Part of #19433 ## Rationale for this change When writing data to a table created with `CREATE EXTERNAL TABLE ... WITH ORDER`, the sorting columns should be recorded in the Parquet file's row group metadata. This allows downstream readers to know the data is sorted and potentially skip sorting operations. ## What changes are included in this PR? - Add `sort_expr_to_sorting_column()` and `lex_ordering_to_sorting_columns()` functions in `metadata.rs` to convert DataFusion ordering to Parquet `SortingColumn` - Add `sorting_columns` field to `ParquetSink` with `with_sorting_columns()` builder method - Update `create_writer_physical_plan()` to pass order requirements to `ParquetSink` - Update `create_writer_props()` to set sorting columns on `WriterProperties` - Add test verifying `sorting_columns` metadata is written correctly ## Are these changes tested? Yes, added `test_create_table_with_order_writes_sorting_columns` that: 1. Creates an external table with `WITH ORDER (a ASC NULLS FIRST, b DESC NULLS LAST)` 2. Inserts data 3. Reads the Parquet file and verifies the `sorting_columns` metadata matches the expected order ## Are there any user-facing changes? No user-facing API changes. Parquet files written via `INSERT INTO` or `COPY` for tables with `WITH ORDER` will now contain `sorting_columns` metadata in the row group. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <[email protected]> Co-authored-by: Copilot <[email protected]>

adriangb · 2026-01-08T18:56:55Z

@zhuqi-lucas I think this is ready for another review now, I've incorporated the changes from the broken out PRs, rebased on main and addressed your previous feedback.

zhuqi-lucas

Great work @adriangb, LGTM now!

datafusion/physical-expr-common/src/sort_expr.rs

This PR adds the ability for DataFusion to automatically infer the sort ordering of Parquet files from their embedded `sorting_columns` metadata. Key changes: - Add `FileMeta` struct and `infer_ordering`/`infer_stats_and_ordering` methods to `FileFormat` trait - Add `ordering_from_parquet_metadata` function to convert Parquet sorting_columns to LexOrdering - Implement `infer_ordering` in `ParquetFormat` - Add `derive_common_ordering_from_files` to find common ordering prefix across all files in a scan - Update `ListingTable` to derive output ordering from file orderings when user doesn't specify `file_sort_order` - Add `is_prefix` method to `LexOrdering` Co-Authored-By: Claude Opus 4.5 <[email protected]>

Co-authored-by: Qi Zhu <[email protected]>

zhuqi-lucas self-requested a review December 21, 2025 08:35

zhuqi-lucas reviewed Dec 23, 2025

View reviewed changes

adriangb marked this pull request as ready for review December 29, 2025 06:44

adriangb force-pushed the infer-ordering branch from 8fb624f to 1a52a45 Compare December 29, 2025 06:45

zhuqi-lucas approved these changes Dec 31, 2025

View reviewed changes

alamb reviewed Jan 1, 2026

View reviewed changes

adriangb force-pushed the infer-ordering branch from 53aab8b to bcc1db0 Compare January 1, 2026 20:23

This was referenced Jan 1, 2026

Record sort order when writing Parquet with WITH ORDER #19595

Merged

Refactor PartitionedFile: add ordering field and new_from_meta constructor #19596

Merged

Refactor cache APIs to support ordering information #19597

Merged

adriangb force-pushed the infer-ordering branch from bcc1db0 to 98aa1d8 Compare January 8, 2026 18:40

github-actions bot removed core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jan 8, 2026

github-actions bot removed substrait Changes to the substrait crate execution Related to the execution crate proto Related to proto crate labels Jan 8, 2026

adriangb changed the title ~~infer parquet file order from metadata, write order in ParquetSink~~ infer parquet file order from metadata and use it to optimize scans Jan 8, 2026

github-actions bot added the execution Related to the execution crate label Jan 8, 2026

adriangb requested a review from zhuqi-lucas January 8, 2026 18:56

zhuqi-lucas approved these changes Jan 9, 2026

View reviewed changes

datafusion/physical-expr-common/src/sort_expr.rs Outdated Show resolved Hide resolved

adriangb and others added 6 commits January 9, 2026 13:08

fmt

41ccad0

lint

713d6f0

Apply suggestion from @zhuqi-lucas

0d2d39a

Co-authored-by: Qi Zhu <[email protected]>

add suggested test

aff4208

fix compile failure

ebd3c52

adriangb force-pushed the infer-ordering branch from 25a5aa0 to ebd3c52 Compare January 9, 2026 18:16

github-actions bot added the core Core DataFusion crate label Jan 9, 2026

remove is_prefix

d292de6

github-actions bot removed the physical-expr Changes to the physical-expr crates label Jan 9, 2026

adriangb added 2 commits January 9, 2026 13:35

fix doc

18b297f

fix order

5034388

adriangb added this pull request to the merge queue Jan 9, 2026

Merged via the queue into apache:main with commit 20870da Jan 9, 2026
28 checks passed

-                // If any file has no ordering, we can't derive a common ordering
+                // If any file has no ordering, we can't derive a common ordering
+                tracing!(
+        "Cannot derive common ordering: file {} has different ordering ({:?}) than first file ({:?})",
+        file.object_meta.location, ordering, first
+    );

infer parquet file order from metadata and use it to optimize scans #19433

infer parquet file order from metadata and use it to optimize scans #19433

Conversation

adriangb commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Dec 21, 2025

Uh oh!

zhuqi-lucas commented Dec 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Dec 23, 2025

Uh oh!

adriangb commented Dec 31, 2025

Uh oh!

zhuqi-lucas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Jan 1, 2026

Uh oh!

adriangb commented Jan 8, 2026

Uh oh!

zhuqi-lucas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adriangb commented Dec 21, 2025 •

edited

Loading

zhuqi-lucas commented Dec 31, 2025 •

edited

Loading