Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 2, 2025

Which issue does this PR close?

TODO:

Rationale for this change

Upgrade to the latest arrow

Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new metadata reader.

What changes are included in this PR?

  1. Update arrow/parquet
  2. Update prost
  3. Update substrait
  4. Update pbjson
  5. Make API changes to avoid deprecated APIs

Are these changes tested?

By CI

Are there any user-facing changes?

New arrow

@github-actions github-actions bot added the common Related to common crate label Oct 2, 2025
@github-actions github-actions bot added substrait Changes to the substrait crate proto Related to proto crate labels Oct 2, 2025
@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2025

Many of the current failures are due because this used to work:

select arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'

or

SELECT arrow_cast(secs, 'Timestamp(Millisecond, None)') FROM t

After the arrow 57 upgrade it fails with errors like

statement error DataFusion error: Execution error: Unsupported type 'Timestamp\(Nanosecond, None\)'\. Must be a supported arrow type name such as 'Int32' or 'Timestamp\(ns\)'\. Error expected double quoted string for Timezone, got 'None'
# arrow_typeof_timestamp
query T
SELECT arrow_typeof(now()::timestamp)
----
Timestamp(ns)

I believe the problem is that the format of the timezone has changed into Timestamp(ns) and then the FromStr method doesn't handle that. I will work on filing an update

I think what we need to do is support both formats for backwards compatibility. I will work on an upstream issue


// Create Flight client
let mut client = FlightServiceClient::connect("http://localhost:50051").await?;
let endpoint = Endpoint::new("http://localhost:50051")?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to new version of tonic


// add an initial FlightData message that sends schema
let options = arrow::ipc::writer::IpcWriteOptions::default();
let mut compression_context = CompressionContext::default();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


let validate =
T::validate_decimal_precision(new_value, self.target_precision);
let validate = T::validate_decimal_precision(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(nullable List(nullable Int64)) List(nullable Float64) List(nullable Utf8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the diffs in this file are related to improvements in DataType display, tracked in this ticket

I will try and call out individual changes when I see them. Lists are way nicer now:

05)--------ProjectionExec: expr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }])
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View } }])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT arrow_typeof(now()::timestamp)
----
Timestamp(Nanosecond, None)
Timestamp(ns)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Timestamps: Create a table

statement ok
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp format has changed (improved!) so let's also add tests for the new format

pbjson-types = { workspace = true }
prost = { workspace = true }
substrait = { version = "0.58", features = ["serde"] }
substrait = { version = "0.59", features = ["serde"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since prost is updated, we also must update substrait

@github-actions github-actions bot added the core Core DataFusion crate label Oct 2, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 9d06200 to 1b7b559 Compare October 2, 2025 18:56
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Oct 2, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 8ecbbed to d3b328b Compare October 3, 2025 15:48
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from f61623e to 9f6a390 Compare October 3, 2025 16:04
@github-actions github-actions bot added sql SQL Planner physical-expr Changes to the physical-expr crates optimizer Optimizer rules functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels Oct 3, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from d5bd26e to 7709acc Compare October 3, 2025 20:26
| alltypes_plain.parquet | 1851 | 10181 | 2 | page_index=false |
| alltypes_tiny_pages.parquet | 454233 | 881418 | 2 | page_index=true |
| lz4_raw_compressed_larger.parquet | 380836 | 2939 | 2 | page_index=false |
| alltypes_plain.parquet | 1851 | 10309 | 2 | page_index=false |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the metadata size has increased. I will investigate

let expected = "Field { name: \"c0\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, \
Field { name: \"c1\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }";
assert_eq!(expected, arrow_schema.to_string());
insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many many diffs are due to the changes in formatting of Fields and DataTypes (see below)

+----------------------+
| arrow_typeof(test.l) |
+----------------------+
| List(nullable Int32) |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new display is much easier to read in my opinion

@alamb
Copy link
Contributor Author

alamb commented Oct 3, 2025

Ok, the tests are now looking good enough to test with the new thrift decoder

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (0cfb693) to 0f3cf27 diff using: tpch_mem
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 172.82 ms │              170.20 ms │     no change │
│ QQuery 2     │  25.32 ms │               26.80 ms │  1.06x slower │
│ QQuery 3     │  40.67 ms │               35.01 ms │ +1.16x faster │
│ QQuery 4     │  28.11 ms │               27.70 ms │     no change │
│ QQuery 5     │  75.67 ms │               74.89 ms │     no change │
│ QQuery 6     │  19.71 ms │               19.40 ms │     no change │
│ QQuery 7     │ 211.76 ms │              211.24 ms │     no change │
│ QQuery 8     │  33.37 ms │               30.89 ms │ +1.08x faster │
│ QQuery 9     │ 102.87 ms │               93.38 ms │ +1.10x faster │
│ QQuery 10    │  58.60 ms │               57.87 ms │     no change │
│ QQuery 11    │  16.66 ms │               17.12 ms │     no change │
│ QQuery 12    │  51.03 ms │               50.17 ms │     no change │
│ QQuery 13    │  45.56 ms │               46.20 ms │     no change │
│ QQuery 14    │  13.72 ms │               13.95 ms │     no change │
│ QQuery 15    │  24.12 ms │               23.72 ms │     no change │
│ QQuery 16    │  24.35 ms │               24.28 ms │     no change │
│ QQuery 17    │ 147.33 ms │              145.37 ms │     no change │
│ QQuery 18    │ 316.29 ms │              315.38 ms │     no change │
│ QQuery 19    │  36.14 ms │               36.46 ms │     no change │
│ QQuery 20    │  47.59 ms │               48.14 ms │     no change │
│ QQuery 21    │ 327.74 ms │              294.32 ms │ +1.11x faster │
│ QQuery 22    │  20.95 ms │               19.61 ms │ +1.07x faster │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1840.37ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 1782.09ms │
│ Average Time (HEAD)                   │   83.65ms │
│ Average Time (alamb_upgrade_arrow_57) │   81.00ms │
│ Queries Faster                        │         5 │
│ Queries Slower                        │         1 │
│ Queries with No Change                │        16 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (0cfb693) to 0f3cf27 diff using: clickbench_partitioned
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.22 ms │                2.14 ms │     no change │
│ QQuery 1     │    49.54 ms │               49.79 ms │     no change │
│ QQuery 2     │   133.80 ms │              141.22 ms │  1.06x slower │
│ QQuery 3     │   154.13 ms │              158.96 ms │     no change │
│ QQuery 4     │   999.02 ms │             1029.05 ms │     no change │
│ QQuery 5     │  1457.78 ms │             1484.94 ms │     no change │
│ QQuery 6     │     2.10 ms │                2.23 ms │  1.06x slower │
│ QQuery 7     │    53.99 ms │               54.98 ms │     no change │
│ QQuery 8     │  1378.95 ms │             1413.04 ms │     no change │
│ QQuery 9     │  1743.92 ms │             1760.12 ms │     no change │
│ QQuery 10    │   378.81 ms │              391.56 ms │     no change │
│ QQuery 11    │   435.88 ms │              445.91 ms │     no change │
│ QQuery 12    │  1318.89 ms │             1368.96 ms │     no change │
│ QQuery 13    │  2094.11 ms │             2109.04 ms │     no change │
│ QQuery 14    │  1255.34 ms │             1275.76 ms │     no change │
│ QQuery 15    │  1162.03 ms │             1224.07 ms │  1.05x slower │
│ QQuery 16    │  2635.18 ms │             2656.01 ms │     no change │
│ QQuery 17    │  2586.47 ms │             2645.02 ms │     no change │
│ QQuery 18    │  5483.36 ms │             4868.30 ms │ +1.13x faster │
│ QQuery 19    │   124.80 ms │              126.32 ms │     no change │
│ QQuery 20    │  2104.33 ms │             1991.58 ms │ +1.06x faster │
│ QQuery 21    │  2473.34 ms │             2322.82 ms │ +1.06x faster │
│ QQuery 22    │  4514.85 ms │             3935.77 ms │ +1.15x faster │
│ QQuery 23    │ 14978.72 ms │            12767.28 ms │ +1.17x faster │
│ QQuery 24    │   214.97 ms │              214.15 ms │     no change │
│ QQuery 25    │   518.82 ms │              516.03 ms │     no change │
│ QQuery 26    │   223.15 ms │              214.72 ms │     no change │
│ QQuery 27    │  2948.86 ms │             2848.18 ms │     no change │
│ QQuery 28    │ 23563.88 ms │            24515.81 ms │     no change │
│ QQuery 29    │   986.38 ms │              966.06 ms │     no change │
│ QQuery 30    │  1301.06 ms │             1316.39 ms │     no change │
│ QQuery 31    │  1344.98 ms │             1342.06 ms │     no change │
│ QQuery 32    │  4990.45 ms │             4519.35 ms │ +1.10x faster │
│ QQuery 33    │  5926.40 ms │             5679.44 ms │     no change │
│ QQuery 34    │  5836.64 ms │             6371.95 ms │  1.09x slower │
│ QQuery 35    │  1978.57 ms │             1959.59 ms │     no change │
│ QQuery 36    │   124.02 ms │              120.34 ms │     no change │
│ QQuery 37    │    51.09 ms │               54.29 ms │  1.06x slower │
│ QQuery 38    │   120.80 ms │              121.73 ms │     no change │
│ QQuery 39    │   198.05 ms │              198.87 ms │     no change │
│ QQuery 40    │    43.84 ms │               43.57 ms │     no change │
│ QQuery 41    │    36.81 ms │               38.73 ms │  1.05x slower │
│ QQuery 42    │    31.34 ms │               31.72 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 97961.63ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 95297.87ms │
│ Average Time (HEAD)                   │  2278.18ms │
│ Average Time (alamb_upgrade_arrow_57) │  2216.23ms │
│ Queries Faster                        │          6 │
│ Queries Slower                        │          6 │
│ Queries with No Change                │         31 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (0cfb693) to 0f3cf27 diff using: clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2748.60 ms │             2599.96 ms │ +1.06x faster │
│ QQuery 1     │  1344.48 ms │             1212.98 ms │ +1.11x faster │
│ QQuery 2     │  2453.12 ms │             2329.05 ms │ +1.05x faster │
│ QQuery 3     │  1166.94 ms │             1181.82 ms │     no change │
│ QQuery 4     │  2255.61 ms │             2220.57 ms │     no change │
│ QQuery 5     │ 27120.42 ms │            27691.74 ms │     no change │
│ QQuery 6     │  4325.56 ms │             4126.85 ms │     no change │
│ QQuery 7     │  3581.20 ms │             3545.46 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 44995.92ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 44908.43ms │
│ Average Time (HEAD)                   │  5624.49ms │
│ Average Time (alamb_upgrade_arrow_57) │  5613.55ms │
│ Queries Faster                        │          3 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          5 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘

@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch 3 times, most recently from 56c012a to 3d43be9 Compare October 6, 2025 20:11
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Oct 6, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 6, 2025
caused by
Error during planning: Cannot automatically convert Null to Float16

NULL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (d7a186f) to 307f5c3 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2669.13 ms │             2687.33 ms │ no change │
│ QQuery 1     │  1258.20 ms │             1250.07 ms │ no change │
│ QQuery 2     │  2477.29 ms │             2409.65 ms │ no change │
│ QQuery 3     │  1177.64 ms │             1199.86 ms │ no change │
│ QQuery 4     │  2210.53 ms │             2245.16 ms │ no change │
│ QQuery 5     │ 27733.82 ms │            27501.40 ms │ no change │
│ QQuery 6     │  4164.63 ms │             4195.87 ms │ no change │
│ QQuery 7     │  3560.02 ms │             3450.10 ms │ no change │
└──────────────┴─────────────┴────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 45251.25ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 44939.44ms │
│ Average Time (HEAD)                   │  5656.41ms │
│ Average Time (alamb_upgrade_arrow_57) │  5617.43ms │
│ Queries Faster                        │          0 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          8 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.10 ms │                2.37 ms │  1.13x slower │
│ QQuery 1     │    49.64 ms │               49.66 ms │     no change │
│ QQuery 2     │   135.52 ms │              137.49 ms │     no change │
│ QQuery 3     │   160.84 ms │              163.22 ms │     no change │
│ QQuery 4     │  1058.13 ms │             1085.05 ms │     no change │
│ QQuery 5     │  1475.02 ms │             1499.20 ms │     no change │
│ QQuery 6     │     2.16 ms │                2.17 ms │     no change │
│ QQuery 7     │    52.13 ms │               54.70 ms │     no change │
│ QQuery 8     │  1427.60 ms │             1463.22 ms │     no change │
│ QQuery 9     │  1786.17 ms │             1818.72 ms │     no change │
│ QQuery 10    │   380.63 ms │              384.52 ms │     no change │
│ QQuery 11    │   433.44 ms │              434.55 ms │     no change │
│ QQuery 12    │  1355.66 ms │             1388.73 ms │     no change │
│ QQuery 13    │  2061.13 ms │             2171.21 ms │  1.05x slower │
│ QQuery 14    │  1249.10 ms │             1291.37 ms │     no change │
│ QQuery 15    │  1205.78 ms │             1220.37 ms │     no change │
│ QQuery 16    │  2614.80 ms │             2660.41 ms │     no change │
│ QQuery 17    │  2600.88 ms │             2646.45 ms │     no change │
│ QQuery 18    │  5285.14 ms │             4964.66 ms │ +1.06x faster │
│ QQuery 19    │   129.13 ms │              126.89 ms │     no change │
│ QQuery 20    │  2040.41 ms │             1964.74 ms │     no change │
│ QQuery 21    │  2281.08 ms │             2292.30 ms │     no change │
│ QQuery 22    │  3952.45 ms │             3915.38 ms │     no change │
│ QQuery 23    │ 24772.72 ms │            12618.26 ms │ +1.96x faster │
│ QQuery 24    │   226.35 ms │              207.46 ms │ +1.09x faster │
│ QQuery 25    │   520.81 ms │              498.66 ms │     no change │
│ QQuery 26    │   226.23 ms │              204.89 ms │ +1.10x faster │
│ QQuery 27    │  2920.81 ms │             2881.47 ms │     no change │
│ QQuery 28    │ 25525.20 ms │            22817.55 ms │ +1.12x faster │
│ QQuery 29    │   996.15 ms │              978.96 ms │     no change │
│ QQuery 30    │  1341.89 ms │             1334.67 ms │     no change │
│ QQuery 31    │  1342.96 ms │             1316.97 ms │     no change │
│ QQuery 32    │  4637.08 ms │             4741.83 ms │     no change │
│ QQuery 33    │  5851.44 ms │             5808.19 ms │     no change │
│ QQuery 34    │  6208.66 ms │             5924.13 ms │     no change │
│ QQuery 35    │  2020.55 ms │             1995.64 ms │     no change │
│ QQuery 36    │   119.01 ms │              120.12 ms │     no change │
│ QQuery 37    │    54.05 ms │               51.68 ms │     no change │
│ QQuery 38    │   121.10 ms │              118.81 ms │     no change │
│ QQuery 39    │   197.51 ms │              196.09 ms │     no change │
│ QQuery 40    │    43.38 ms │               43.27 ms │     no change │
│ QQuery 41    │    40.53 ms │               39.70 ms │     no change │
│ QQuery 42    │    33.75 ms │               32.67 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 108939.15ms │
│ Total Time (alamb_upgrade_arrow_57)   │  93668.37ms │
│ Average Time (HEAD)                   │   2533.47ms │
│ Average Time (alamb_upgrade_arrow_57) │   2178.33ms │
│ Queries Faster                        │           5 │
│ Queries Slower                        │           2 │
│ Queries with No Change                │          36 │
│ Queries with Failure                  │           0 │
└───────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 165.05 ms │              169.01 ms │     no change │
│ QQuery 2     │  25.01 ms │               26.85 ms │  1.07x slower │
│ QQuery 3     │  40.78 ms │               36.17 ms │ +1.13x faster │
│ QQuery 4     │  27.94 ms │               27.81 ms │     no change │
│ QQuery 5     │  76.40 ms │               77.21 ms │     no change │
│ QQuery 6     │  19.50 ms │               19.44 ms │     no change │
│ QQuery 7     │ 241.61 ms │              222.48 ms │ +1.09x faster │
│ QQuery 8     │  34.10 ms │               34.01 ms │     no change │
│ QQuery 9     │ 100.36 ms │              104.10 ms │     no change │
│ QQuery 10    │  58.92 ms │               58.72 ms │     no change │
│ QQuery 11    │  16.73 ms │               17.22 ms │     no change │
│ QQuery 12    │  50.20 ms │               50.96 ms │     no change │
│ QQuery 13    │  46.09 ms │               46.53 ms │     no change │
│ QQuery 14    │  13.60 ms │               13.99 ms │     no change │
│ QQuery 15    │  24.39 ms │               23.93 ms │     no change │
│ QQuery 16    │  24.50 ms │               24.24 ms │     no change │
│ QQuery 17    │ 149.75 ms │              145.15 ms │     no change │
│ QQuery 18    │ 321.50 ms │              313.32 ms │     no change │
│ QQuery 19    │  36.50 ms │               35.59 ms │     no change │
│ QQuery 20    │  48.09 ms │               48.47 ms │     no change │
│ QQuery 21    │ 307.66 ms │              321.73 ms │     no change │
│ QQuery 22    │  20.78 ms │               20.08 ms │     no change │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1849.45ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 1837.01ms │
│ Average Time (HEAD)                   │   84.07ms │
│ Average Time (alamb_upgrade_arrow_57) │   83.50ms │
│ Queries Faster                        │         2 │
│ Queries Slower                        │         1 │
│ Queries with No Change                │        19 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (d7a186f) to 307f5c3 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2670.07 ms │             2697.13 ms │ no change │
│ QQuery 1     │  1284.61 ms │             1252.93 ms │ no change │
│ QQuery 2     │  2397.99 ms │             2408.65 ms │ no change │
│ QQuery 3     │  1205.33 ms │             1214.71 ms │ no change │
│ QQuery 4     │  2216.63 ms │             2272.23 ms │ no change │
│ QQuery 5     │ 27756.43 ms │            27174.97 ms │ no change │
│ QQuery 6     │  4135.43 ms │             4244.39 ms │ no change │
│ QQuery 7     │  3472.98 ms │             3594.62 ms │ no change │
└──────────────┴─────────────┴────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 45139.46ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 44859.62ms │
│ Average Time (HEAD)                   │  5642.43ms │
│ Average Time (alamb_upgrade_arrow_57) │  5607.45ms │
│ Queries Faster                        │          0 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          8 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.16 ms │                2.14 ms │     no change │
│ QQuery 1     │    49.05 ms │               49.74 ms │     no change │
│ QQuery 2     │   135.44 ms │              140.61 ms │     no change │
│ QQuery 3     │   158.47 ms │              165.84 ms │     no change │
│ QQuery 4     │  1047.81 ms │             1046.97 ms │     no change │
│ QQuery 5     │  1478.24 ms │             1475.24 ms │     no change │
│ QQuery 6     │     2.18 ms │                2.16 ms │     no change │
│ QQuery 7     │    53.97 ms │               54.12 ms │     no change │
│ QQuery 8     │  1425.73 ms │             1452.60 ms │     no change │
│ QQuery 9     │  1831.81 ms │             1856.92 ms │     no change │
│ QQuery 10    │   382.21 ms │              382.51 ms │     no change │
│ QQuery 11    │   430.49 ms │              435.82 ms │     no change │
│ QQuery 12    │  1355.30 ms │             1393.68 ms │     no change │
│ QQuery 13    │  2094.49 ms │             2155.73 ms │     no change │
│ QQuery 14    │  1251.69 ms │             1285.60 ms │     no change │
│ QQuery 15    │  1193.19 ms │             1242.28 ms │     no change │
│ QQuery 16    │  2622.89 ms │             2641.10 ms │     no change │
│ QQuery 17    │  2624.03 ms │             2638.83 ms │     no change │
│ QQuery 18    │  5559.81 ms │             4941.39 ms │ +1.13x faster │
│ QQuery 19    │   128.38 ms │              124.63 ms │     no change │
│ QQuery 20    │  2004.39 ms │             1966.20 ms │     no change │
│ QQuery 21    │  2294.48 ms │             2269.21 ms │     no change │
│ QQuery 22    │  6004.15 ms │             3885.59 ms │ +1.55x faster │
│ QQuery 23    │ 13706.81 ms │            12636.30 ms │ +1.08x faster │
│ QQuery 24    │   214.09 ms │              218.53 ms │     no change │
│ QQuery 25    │   505.37 ms │              496.53 ms │     no change │
│ QQuery 26    │   227.46 ms │              220.74 ms │     no change │
│ QQuery 27    │  2868.85 ms │             2893.26 ms │     no change │
│ QQuery 28    │ 24966.41 ms │            23003.75 ms │ +1.09x faster │
│ QQuery 29    │   993.30 ms │              964.20 ms │     no change │
│ QQuery 30    │  1327.98 ms │             1331.77 ms │     no change │
│ QQuery 31    │  1364.72 ms │             1309.12 ms │     no change │
│ QQuery 32    │  4709.57 ms │             4928.12 ms │     no change │
│ QQuery 33    │  6046.47 ms │             5858.61 ms │     no change │
│ QQuery 34    │  6357.72 ms │             5985.97 ms │ +1.06x faster │
│ QQuery 35    │  2024.95 ms │             2026.97 ms │     no change │
│ QQuery 36    │   122.74 ms │              120.16 ms │     no change │
│ QQuery 37    │    52.90 ms │               51.51 ms │     no change │
│ QQuery 38    │   119.54 ms │              120.18 ms │     no change │
│ QQuery 39    │   199.65 ms │              196.12 ms │     no change │
│ QQuery 40    │    44.38 ms │               42.45 ms │     no change │
│ QQuery 41    │    39.57 ms │               38.39 ms │     no change │
│ QQuery 42    │    33.09 ms │               33.13 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 100055.95ms │
│ Total Time (alamb_upgrade_arrow_57)   │  94084.72ms │
│ Average Time (HEAD)                   │   2326.88ms │
│ Average Time (alamb_upgrade_arrow_57) │   2188.02ms │
│ Queries Faster                        │           5 │
│ Queries Slower                        │           0 │
│ Queries with No Change                │          38 │
│ Queries with Failure                  │           0 │
└───────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_upgrade_arrow_57 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 165.36 ms │              170.60 ms │ no change │
│ QQuery 2     │  27.35 ms │               26.80 ms │ no change │
│ QQuery 3     │  42.37 ms │               40.95 ms │ no change │
│ QQuery 4     │  27.71 ms │               28.10 ms │ no change │
│ QQuery 5     │  76.29 ms │               77.02 ms │ no change │
│ QQuery 6     │  19.61 ms │               19.44 ms │ no change │
│ QQuery 7     │ 211.77 ms │              216.05 ms │ no change │
│ QQuery 8     │  34.42 ms │               34.39 ms │ no change │
│ QQuery 9     │ 106.04 ms │              105.53 ms │ no change │
│ QQuery 10    │  57.94 ms │               59.55 ms │ no change │
│ QQuery 11    │  16.56 ms │               17.36 ms │ no change │
│ QQuery 12    │  51.62 ms │               52.31 ms │ no change │
│ QQuery 13    │  46.21 ms │               45.74 ms │ no change │
│ QQuery 14    │  14.38 ms │               13.75 ms │ no change │
│ QQuery 15    │  24.33 ms │               24.02 ms │ no change │
│ QQuery 16    │  24.49 ms │               24.18 ms │ no change │
│ QQuery 17    │ 150.59 ms │              149.42 ms │ no change │
│ QQuery 18    │ 322.35 ms │              321.58 ms │ no change │
│ QQuery 19    │  36.84 ms │               35.63 ms │ no change │
│ QQuery 20    │  48.90 ms │               48.81 ms │ no change │
│ QQuery 21    │ 348.68 ms │              337.22 ms │ no change │
│ QQuery 22    │  20.96 ms │               20.64 ms │ no change │
└──────────────┴───────────┴────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1874.75ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 1869.09ms │
│ Average Time (HEAD)                   │   85.22ms │
│ Average Time (alamb_upgrade_arrow_57) │   84.96ms │
│ Queries Faster                        │         0 │
│ Queries Slower                        │         0 │
│ Queries with No Change                │        22 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from d7a186f to 28452cf Compare October 10, 2025 18:19
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Oct 10, 2025
@github-actions github-actions bot added the datasource Changes to the datasource crate label Oct 10, 2025
async fn predicate_cache_pushdown_disable() -> datafusion_common::Result<()> {
// Can disable the cache even with filter pushdown by setting the size to 0. In this case we
// expect the inner records are reported but no records are read from the cache
// no records are read from the cache and no metrics are reported
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to @nuno-faria 's work to close ❤️

I was somewhat surprised that there are no metrics at all reported, but I think it makes sense as the reporting is currently only done by the cache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb for handling the upgrade.
I think we could add a test to confirm that datafusion.execution.parquet.max_predicate_cache_size now works as expected, by analyzing the explain output.

Here is a potential example:

#[tokio::test]
async fn test_disable_predicate_cache() {
    let mut parquet_options = TableParquetOptions::new();
    parquet_options.global.data_page_row_count_limit = 1;
    parquet_options.global.write_batch_size = 1;

    let tempdir = TempDir::new_in(Path::new(".")).unwrap();
    let path = tempdir.path().to_str().unwrap();

    let ctx = SessionContext::new();
    ctx.sql("select i from generate_series(1, 1000) t(i)")
        .await
        .unwrap()
        .write_parquet(path, DataFrameWriteOptions::new(), Some(parquet_options))
        .await
        .unwrap();

    let regex = Regex::new(r"bytes_scanned=(\d+)").ok().unwrap();

    let config = SessionConfig::new()
        .set_bool("datafusion.execution.parquet.pushdown_filters", true);
    let ctx = SessionContext::new_with_config(config);

    // default: predicate cache is enabled
    ctx.register_parquet("t", path, ParquetReadOptions::new())
        .await
        .unwrap();
    let plan = ctx
        .sql("select * from t where i = 123")
        .await
        .unwrap()
        .explain(false, true)
        .unwrap()
        .to_string()
        .await
        .unwrap();
    let captures = regex.captures(&plan).unwrap();
    let bytes_scanned_default =
        captures.get(1).unwrap().as_str().parse::<usize>().unwrap();

    // disabling the predicate cache by setting the limit to 0
    ctx.sql("set datafusion.execution.parquet.max_predicate_cache_size = 0")
        .await
        .unwrap()
        .collect()
        .await
        .unwrap();
    ctx.deregister_table("t").unwrap();
    ctx.register_parquet("t", path, ParquetReadOptions::new())
        .await
        .unwrap();
    let plan = ctx
        .sql("select * from t where i = 123")
        .await
        .unwrap()
        .explain(false, true)
        .unwrap()
        .to_string()
        .await
        .unwrap();
    let captures = regex.captures(&plan).unwrap();
    let bytes_scanned_cache_disabled =
        captures.get(1).unwrap().as_str().parse::<usize>().unwrap();

    // with the cache disabled, fewer data pages should be retrieved (the predicate cache can
    // retrieve multiple data pages when their size is less than batch_size)
    assert_eq!(bytes_scanned_default, 31405);
    assert_eq!(bytes_scanned_cache_disabled, 1691);
}

| alltypes_plain.parquet | 1851 | 10181 | 2 | page_index=false |
| alltypes_tiny_pages.parquet | 454233 | 881418 | 2 | page_index=true |
| lz4_raw_compressed_larger.parquet | 380836 | 2939 | 2 | page_index=false |
| alltypes_plain.parquet | 1851 | 7166 | 2 | page_index=false |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why the heap size of the metadata is reported to be so much smaller. I don't expect our thrift decoding work to have reduce the in-memory size of the parquet metadata 🤔

@etseidl any ideas? I can perhaps go audit the heap_size implementations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did box some of the encryption structures...but maybe the HeapSize impl for Box is still wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and FileDecryptor in ParquetMetaData was boxed but still not included in memory_size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and the column index should also be much smaller since we transposed it from an array of structs to a struct of arrays. alltypes_tiny_pages.parquet has an insane number of pages, so any savings in the column index will be magnified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate functions Changes to functions implementation optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants