|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: Apache DataFusion 51.0.0 Released |
| 4 | +date: 2025-11-25 |
| 5 | +author: pmc |
| 6 | +categories: [release] |
| 7 | +--- |
| 8 | + |
| 9 | +<!-- |
| 10 | +{% comment %} |
| 11 | +Licensed to the Apache Software Foundation (ASF) under one or more |
| 12 | +contributor license agreements. See the NOTICE file distributed with |
| 13 | +this work for additional information regarding copyright ownership. |
| 14 | +The ASF licenses this file to you under the Apache License, Version 2.0 |
| 15 | +(the "License"); you may not use this file except in compliance with |
| 16 | +the License. You may obtain a copy of the License at |
| 17 | +
|
| 18 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 19 | +
|
| 20 | +Unless required by applicable law or agreed to in writing, software |
| 21 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 22 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 23 | +See the License for the specific language governing permissions and |
| 24 | +limitations under the License. |
| 25 | +{% endcomment %} |
| 26 | +--> |
| 27 | + |
| 28 | +[TOC] |
| 29 | + |
| 30 | +## Introduction |
| 31 | + |
| 32 | +We are proud to announce the release of [DataFusion 51.0.0]. This post highlights |
| 33 | +some of the major improvements since [DataFusion 50.0.0]. The complete list of |
| 34 | +changes is available in the [changelog]. Thanks to the [128 contributors] for |
| 35 | +making this release possible. |
| 36 | + |
| 37 | +[DataFusion 51.0.0]: https://crates.io/crates/datafusion/51.0.0 |
| 38 | +[DataFusion 50.0.0]: https://datafusion.apache.org/blog/2025/09/29/datafusion-50.0.0/ |
| 39 | +[changelog]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md |
| 40 | +[128 contributors]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md#credits |
| 41 | + |
| 42 | +## Performance Improvements 🚀 |
| 43 | +We continue to make significant performance improvements in DataFusion, both in |
| 44 | +the core engine and in the Parquet reader. |
| 45 | + |
| 46 | +<img |
| 47 | +src="/blog/images/datafusion-51.0.0/performance_over_time_clickbench.png" |
| 48 | +width="100%" |
| 49 | +class="img-responsive" |
| 50 | +alt="Performance over time" |
| 51 | +/> |
| 52 | + |
| 53 | +**Figure 1**: Average and median normalized query execution times for ClickBench queries for DataFusion 51.0.0 compared to previous releases. |
| 54 | +Query times are normalized using the ClickBench definition. See the |
| 55 | +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/) |
| 56 | +for more details. |
| 57 | + |
| 58 | +### Faster `CASE` expression evaluation |
| 59 | + |
| 60 | +This release builds on the [CASE performance epic] with significant improvements. |
| 61 | +Expressions short‑circuit earlier, reuse partial results, and avoid unnecessary |
| 62 | +scattering, speeding up common ETL patterns. Thanks to [pepijnve], [chenkovsky], |
| 63 | +and [petern48] for leading this effort. We hope to share more details on our |
| 64 | +implementation in a future post. |
| 65 | + |
| 66 | +[pepijnve]: https://github.com/pepijnve |
| 67 | +[chenkovsky]: https://github.com/chenkovsky |
| 68 | +[petern48]: https://github.com/petern48 |
| 69 | + |
| 70 | +### Better Defaults for Remote Parquet Reads |
| 71 | + |
| 72 | +By default, DataFusion now always fetches the last 512KB (configurable) of [Apache Parquet] files |
| 73 | +which usually includes the footer and metadata ([#18118]). This |
| 74 | +change typically avoids 2 I/O requests for each Parquet. While this |
| 75 | +setting has existed in DataFusion for many years, it was not previously enabled |
| 76 | +by default. Users can tune the number of bytes fetched in the initial I/O |
| 77 | +request via the `datafusion.execution.parquet.metadata_size_hint` [config setting]. Thanks to |
| 78 | +[zhuqi-lucas] for leading this effort. |
| 79 | + |
| 80 | +[config setting]: https://datafusion.apache.org/user-guide/configs.html |
| 81 | +[apache parquet]: https://parquet.apache.org/ |
| 82 | + |
| 83 | +### Faster Parquet metadata parsing |
| 84 | + |
| 85 | +DataFusion 51 also includes the latest Parquet reader from |
| 86 | +[Arrow Rust 57.0.0], which parses Parquet metadata significantly faster. This is |
| 87 | +especially beneficial for workloads with many small Parquet files and scenarios |
| 88 | +where startup time or low latency is important. You can read more about the upstream work by |
| 89 | +[etseidl] and [jhorstmann] that enabled these improvements in the [Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser] blog. |
| 90 | + |
| 91 | +<img |
| 92 | + src="/blog/images/datafusion-51.0.0/arrow-57-metadata-parsing.png" |
| 93 | + width="100%" |
| 94 | + class="img-responsive" |
| 95 | + alt="Metadata Parsing Performance Improvements in Arrow/Parquet 57" |
| 96 | +/> |
| 97 | + |
| 98 | +**Figure 2**: Metadata parsing performance improvements in Arrow/Parquet 57.0.0. |
| 99 | + |
| 100 | +[Arrow Rust 57.0.0]: https://arrow.apache.org/blog/2025/10/30/arrow-rs-57.0.0/ |
| 101 | +[Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser]: https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/ |
| 102 | + |
| 103 | + |
| 104 | + |
| 105 | +## New Features ✨ |
| 106 | + |
| 107 | +### Decimal32/Decimal64 support |
| 108 | + |
| 109 | +The new Arrow types `Decimal32` and `Decimal64` are now supported in DataFusion |
| 110 | +([#17501]), including aggregations such as `SUM`, `AVG`, `MIN/MAX`, and window |
| 111 | +functions. Thanks to [AdamGS] for leading this effort. |
| 112 | + |
| 113 | + |
| 114 | +### SQL Pipe Operators |
| 115 | + |
| 116 | +DataFusion now supports the SQL pipe operator syntax |
| 117 | +([#17278]), enabling inline transforms such as: |
| 118 | + |
| 119 | +```sql |
| 120 | +SELECT * FROM t |
| 121 | +|> WHERE a > 10 |
| 122 | +|> ORDER BY b |
| 123 | +|> LIMIT 5; |
| 124 | +``` |
| 125 | + |
| 126 | +This syntax, [popularized by Google BigQuery], keeps multi-step transformations concise while preserving regular |
| 127 | +SQL semantics. Thanks to [simonvandel] for leading this effort. |
| 128 | + |
| 129 | +[popularized by Google BigQuery]: https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/pipe-syntax |
| 130 | + |
| 131 | +### I/O Profiling in `datafusion-cli` |
| 132 | + |
| 133 | +[datafusion-cli] now has built-in instrumentation to trace object store calls |
| 134 | +([#17207]). Toggle profiling |
| 135 | +with the [\object_store_profiling command] and inspect the exact `GET`/`LIST` requests issued during |
| 136 | +query execution: |
| 137 | + |
| 138 | +[datafusion-cli]: https://datafusion.apache.org/user-guide/cli/ |
| 139 | +[\object_store_profiling command]: https://datafusion.apache.org/user-guide/cli/usage.html#commands |
| 140 | + |
| 141 | +```sql |
| 142 | +DataFusion CLI v51.0.0 |
| 143 | +> \object_store_profiling trace |
| 144 | +ObjectStore Profile mode set to Trace |
| 145 | +> select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; |
| 146 | ++----------+ |
| 147 | +| count(*) | |
| 148 | ++----------+ |
| 149 | +| 1000000 | |
| 150 | ++----------+ |
| 151 | +1 row(s) fetched. |
| 152 | +Elapsed 0.367 seconds. |
| 153 | + |
| 154 | +Object Store Profiling |
| 155 | +Instrumented Object Store: instrument_mode: Trace, inner: HttpStore |
| 156 | +2025-11-19T21:10:43.476121+00:00 operation=Head duration=0.069763s path=hits_compatible/athena_partitioned/hits_1.parquet |
| 157 | +2025-11-19T21:10:43.545903+00:00 operation=Head duration=0.025859s path=hits_compatible/athena_partitioned/hits_1.parquet |
| 158 | +2025-11-19T21:10:43.571768+00:00 operation=Head duration=0.025684s path=hits_compatible/athena_partitioned/hits_1.parquet |
| 159 | +2025-11-19T21:10:43.597463+00:00 operation=Get duration=0.034194s size=524288 range: bytes=174440756-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet |
| 160 | +2025-11-19T21:10:43.705821+00:00 operation=Head duration=0.022029s path=hits_compatible/athena_partitioned/hits_1.parquet |
| 161 | + |
| 162 | +Summaries: |
| 163 | ++-----------+----------+-----------+-----------+-----------+-----------+-------+ |
| 164 | +| Operation | Metric | min | max | avg | sum | count | |
| 165 | ++-----------+----------+-----------+-----------+-----------+-----------+-------+ |
| 166 | +| Get | duration | 0.034194s | 0.034194s | 0.034194s | 0.034194s | 1 | |
| 167 | +| Get | size | 524288 B | 524288 B | 524288 B | 524288 B | 1 | |
| 168 | +| Head | duration | 0.022029s | 0.069763s | 0.035834s | 0.143335s | 4 | |
| 169 | +| Head | size | | | | | 4 | |
| 170 | ++-----------+----------+-----------+-----------+-----------+-----------+-------+ |
| 171 | +``` |
| 172 | + |
| 173 | +This makes it far easier to diagnose slow remote scans and validate caching |
| 174 | +strategies. Thanks to [BlakeOrth] for leading this effort. |
| 175 | + |
| 176 | +### `DESCRIBE <query>` |
| 177 | + |
| 178 | +`DESCRIBE` now works on arbitrary queries, returning the schema instead |
| 179 | +of being an alias for `EXPLAIN` ([#18234](https://github.com/apache/datafusion/issues/18234)). This brings DataFusion in line with engines |
| 180 | +like DuckDB and makes it easy to inspect the output schema of queries |
| 181 | +without executing them. Thanks to [djanderson] for leading this effort. |
| 182 | + |
| 183 | +[djanderson]: https://github.com/djanderson |
| 184 | + |
| 185 | +For example: |
| 186 | + |
| 187 | +```sql |
| 188 | +DataFusion CLI v51.0.0 |
| 189 | +> create table t(a int, b varchar, c float) as values (1, 'a', 2.0); |
| 190 | +0 row(s) fetched. |
| 191 | +Elapsed 0.002 seconds. |
| 192 | + |
| 193 | +> DESCRIBE SELECT a, b, SUM(c) FROM t GROUP BY a, b; |
| 194 | + |
| 195 | ++-------------+-----------+-------------+ |
| 196 | +| column_name | data_type | is_nullable | |
| 197 | ++-------------+-----------+-------------+ |
| 198 | +| a | Int32 | YES | |
| 199 | +| b | Utf8View | YES | |
| 200 | +| sum(t.c) | Float64 | YES | |
| 201 | ++-------------+-----------+-------------+ |
| 202 | +3 row(s) fetched. |
| 203 | +``` |
| 204 | + |
| 205 | + |
| 206 | +### Named arguments in SQL functions |
| 207 | + |
| 208 | +DataFusion now understands [PostgreSQL-style named arguments] (`param => value`) |
| 209 | +for scalar, aggregate, and window functions ([#17379](https://github.com/apache/datafusion/issues/17379)). You can mix positional and named |
| 210 | +arguments in any order, and error messages now list parameter names to make |
| 211 | +diagnostics clearer. UDF authors can also expose parameter names so their |
| 212 | +functions benefit from the same syntax. Thanks to [timsaucer] and [bubulalabu] for leading this effort. |
| 213 | + |
| 214 | +[PostgreSQL-style named arguments]: https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html |
| 215 | + |
| 216 | +For example, you can pass arguments to functions like this: |
| 217 | +```sql |
| 218 | +SELECT power(exponent => 3.0, base => 2.0); |
| 219 | +``` |
| 220 | + |
| 221 | +[timsaucer]: https://github.com/timsaucer |
| 222 | +[bubulalabu]: https://github.com/bubulalabu |
| 223 | + |
| 224 | +### Metrics improvements |
| 225 | + |
| 226 | +The output of [EXPLAIN ANALYZE] has been improved to include more metrics |
| 227 | +about execution time and memory usage of each operator ([#18217]). |
| 228 | +You can learn more about these new metrics in the [metrics user guide]. Thanks to |
| 229 | +[2010YOUY01] for leading this effort. |
| 230 | + |
| 231 | + |
| 232 | +[#18217]: https://github.com/apache/datafusion/issues/18217 |
| 233 | +[2010YOUY01]: https://github.com/2010YOUY01 |
| 234 | + |
| 235 | +The `51.0.0` release adds: |
| 236 | + |
| 237 | +- **Configuration**: adds a new option `datafusion.explain.analyze_level`, which can be set to `summary` for a concise output or `dev` for the full set of metrics (the previous default). |
| 238 | +- **For all major operators**: adds `output_bytes`, reporting how many bytes of data each operator produces. |
| 239 | +- **FilterExec**: adds a `selectivity` metric (`output_rows / input_rows`) to show how effective the filter is. |
| 240 | +- **AggregateExec**: |
| 241 | + - adds detailed timing metrics for group-ID computation, aggregate argument evaluation, aggregation work, and emitting final results. |
| 242 | + - adds a `reduction_factor` metric (`output_rows / input_rows`) to show how much grouping reduces the data. |
| 243 | +- **NestedLoopJoinExec**: adds a `selectivity` metric (`output_rows / (left_rows * right_rows)`) to show how many combinations actually pass the join condition. |
| 244 | +- Several display formatting improvements were added to make `EXPLAIN ANALYZE` output easier to read. |
| 245 | + |
| 246 | +[EXPLAIN ANALYZE]: https://datafusion.apache.org/user-guide/sql/explain.html#explain-analyze |
| 247 | +[metrics user guide]: https://datafusion.apache.org/user-guide/metrics.html |
| 248 | + |
| 249 | +For example, the following query: |
| 250 | +```sql |
| 251 | +set datafusion.explain.analyze_level = summary |
| 252 | + |
| 253 | +explain analyze |
| 254 | +select count(*) |
| 255 | +from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' |
| 256 | +where "URL" <> ''; |
| 257 | +``` |
| 258 | + |
| 259 | +Now shows easier-to-understand metrics such as: |
| 260 | + |
| 261 | +```text |
| 262 | + metrics=[ |
| 263 | + output_rows=1000000, |
| 264 | + elapsed_compute=16ns, |
| 265 | + output_bytes=222.5 MB, |
| 266 | + files_ranges_pruned_statistics=16 total → 16 matched, |
| 267 | + row_groups_pruned_statistics=3 total → 3 matched, |
| 268 | + row_groups_pruned_bloom_filter=3 total → 3 matched, |
| 269 | + page_index_rows_pruned=0 total → 0 matched, |
| 270 | + bytes_scanned=33661364, |
| 271 | + metadata_load_time=4.243098ms, |
| 272 | +] |
| 273 | +``` |
| 274 | + |
| 275 | +## Upgrade Guide and Changelog |
| 276 | + |
| 277 | +Upgrading to 51.0.0 should be straightforward for most users. Please review the |
| 278 | +[Upgrade Guide] |
| 279 | +for details on breaking changes and code snippets to help with the transition. |
| 280 | +For a comprehensive list of all changes, please refer to the [changelog]. |
| 281 | + |
| 282 | +## About DataFusion |
| 283 | + |
| 284 | +[Apache DataFusion] is an extensible query engine, written in [Rust], that uses |
| 285 | +[Apache Arrow] as its in-memory format. DataFusion is used by developers to |
| 286 | +create new, fast, data-centric systems such as databases, dataframe libraries, |
| 287 | +and machine learning and streaming applications. While [DataFusion’s primary |
| 288 | +design goal] is to accelerate the creation of other data-centric systems, it |
| 289 | +provides a reasonable experience directly out of the box as a [dataframe |
| 290 | +library], [Python library], and [command-line SQL tool]. |
| 291 | + |
| 292 | +[apache datafusion]: https://datafusion.apache.org/ |
| 293 | +[rust]: https://www.rust-lang.org/ |
| 294 | +[apache arrow]: https://arrow.apache.org |
| 295 | +[DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals |
| 296 | +[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html |
| 297 | +[python library]: https://datafusion.apache.org/python/ |
| 298 | +[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/ |
| 299 | +[Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html |
| 300 | +[zhuqi-lucas]: https://github.com/zhuqi-lucas |
| 301 | +[AdamGS]: https://github.com/AdamGS |
| 302 | +[simonvandel]: https://github.com/simonvandel |
| 303 | +[BlakeOrth]: https://github.com/BlakeOrth |
| 304 | +[CASE performance epic]: https://github.com/apache/datafusion/issues/18075 |
| 305 | +[#18118]: https://github.com/apache/datafusion/issues/18118 |
| 306 | +[#17501]: https://github.com/apache/datafusion/pull/17501 |
| 307 | +[#17278]: https://github.com/apache/datafusion/pull/17278 |
| 308 | +[#17207]: https://github.com/apache/datafusion/issues/17207 |
| 309 | +[#17379]: https://github.com/apache/datafusion/issues/17379 |
| 310 | +[etseidl]: https://github.com/etseidl |
| 311 | +[jhorstmann]: https://github.com/jhorstmann |
| 312 | + |
| 313 | +DataFusion's core thesis is that, as a community, together we can build much |
| 314 | +more advanced technology than any of us as individuals or companies could build |
| 315 | +alone. Without DataFusion, highly performant vectorized query engines would |
| 316 | +remain the domain of a few large companies and world-class research |
| 317 | +institutions. With DataFusion, we can all build on top of a shared foundation |
| 318 | +and focus on what makes our projects unique. |
| 319 | + |
| 320 | +## How to Get Involved |
| 321 | + |
| 322 | +DataFusion is not a project built or driven by a single person, company, or |
| 323 | +foundation. Rather, our community of users and contributors works together to |
| 324 | +build a shared technology that none of us could have built alone. |
| 325 | + |
| 326 | +If you are interested in joining us, we would love to have you. You can try out |
| 327 | +DataFusion on some of your own data and projects and let us know how it goes, |
| 328 | +contribute suggestions, documentation, bug reports, or a PR with documentation, |
| 329 | +tests, or code. A list of open issues suitable for beginners is [here], and you |
| 330 | +can find out how to reach us on the [communication doc]. |
| 331 | + |
| 332 | +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 |
| 333 | +[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html |
0 commit comments