apache
diff --git a/‎content/blog/2025-11-25-datafusion-51.0.0.md‎
Lines changed: 333 additions & 0 deletions b/‎content/blog/2025-11-25-datafusion-51.0.0.md‎
Lines changed: 333 additions & 0 deletions
diff --git a/‎content/images/datafusion-51.0.0/arrow-57-metadata-parsing.png‎
76.6 KB b/‎content/images/datafusion-51.0.0/arrow-57-metadata-parsing.png‎
76.6 KB
diff --git a/‎content/images/datafusion-51.0.0/performance_over_time_clickbench.png‎
60.5 KB b/‎content/images/datafusion-51.0.0/performance_over_time_clickbench.png‎
60.5 KB
@@ -0,0 +1,333 @@
+---
+layout: post
+title: Apache DataFusion 51.0.0 Released
+date: 2025-11-25
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 51.0.0]. This post highlights
+some of the major improvements since [DataFusion 50.0.0]. The complete list of
+changes is available in the [changelog]. Thanks to the [128 contributors] for
+making this release possible.
+
+[DataFusion 51.0.0]: https://crates.io/crates/datafusion/51.0.0
+[DataFusion 50.0.0]: https://datafusion.apache.org/blog/2025/09/29/datafusion-50.0.0/
+[changelog]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md
+[128 contributors]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md#credits
+
+## Performance Improvements 🚀
+We continue to make significant performance improvements in DataFusion, both in
+the core engine and in the Parquet reader.
+
+<img
+src="/blog/images/datafusion-51.0.0/performance_over_time_clickbench.png"
+width="100%"
+class="img-responsive"
+alt="Performance over time"
+/>
+
+**Figure 1**: Average and median normalized query execution times for ClickBench queries for DataFusion 51.0.0 compared to previous releases.
+Query times are normalized using the ClickBench definition. See the
+[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/)
+for more details.
+
+### Faster `CASE` expression evaluation
+
+This release builds on the [CASE performance epic] with significant improvements.
+Expressions short‑circuit earlier, reuse partial results, and avoid unnecessary
+scattering, speeding up common ETL patterns. Thanks to [pepijnve], [chenkovsky],
+and [petern48] for leading this effort. We hope to share more details on our
+implementation in a future post.
+
+[pepijnve]: https://github.com/pepijnve
+[chenkovsky]: https://github.com/chenkovsky
+[petern48]: https://github.com/petern48
+
+### Better Defaults for Remote Parquet Reads
+
+By default, DataFusion now always fetches the last 512KB (configurable) of [Apache Parquet] files
+which usually includes the footer and metadata ([#18118]). This 
+change typically avoids 2 I/O requests for each Parquet. While this
+setting has existed in DataFusion for many years, it was not previously enabled
+by default. Users can tune the number of bytes fetched in the initial I/O
+request via the `datafusion.execution.parquet.metadata_size_hint` [config setting]. Thanks to
+[zhuqi-lucas] for leading this effort.
+
+[config setting]: https://datafusion.apache.org/user-guide/configs.html
+[apache parquet]: https://parquet.apache.org/
+
+### Faster Parquet metadata parsing
+
+DataFusion 51 also includes the latest Parquet reader from
+[Arrow Rust 57.0.0], which parses Parquet metadata significantly faster. This is
+especially beneficial for workloads with many small Parquet files and scenarios
+where startup time or low latency is important. You can read more about the upstream work by
+[etseidl] and [jhorstmann] that enabled these improvements in the [Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser] blog.
+
+<img 
+  src="/blog/images/datafusion-51.0.0/arrow-57-metadata-parsing.png"
+  width="100%" 
+  class="img-responsive" 
+  alt="Metadata Parsing Performance Improvements in Arrow/Parquet 57" 
+/>
+
+**Figure 2**: Metadata parsing performance improvements in Arrow/Parquet 57.0.0. 
+
+[Arrow Rust 57.0.0]: https://arrow.apache.org/blog/2025/10/30/arrow-rs-57.0.0/
+[Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser]: https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/
+
+
+
+## New Features ✨
+
+### Decimal32/Decimal64 support
+
+The new Arrow types `Decimal32` and `Decimal64` are now supported in DataFusion
+([#17501]), including aggregations such as `SUM`, `AVG`, `MIN/MAX`, and window
+functions. Thanks to [AdamGS] for leading this effort.
+
+
+### SQL Pipe Operators
+
+DataFusion now supports the SQL pipe operator syntax
+([#17278]), enabling inline transforms such as:
+
+```sql
+SELECT * FROM t
+|> WHERE a > 10
+|> ORDER BY b
+|> LIMIT 5;
+```
+
+This syntax, [popularized by Google BigQuery], keeps multi-step transformations concise while preserving regular
+SQL semantics. Thanks to [simonvandel] for leading this effort.
+
+[popularized by Google BigQuery]: https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/pipe-syntax
+
+### I/O Profiling in `datafusion-cli`
+
+[datafusion-cli] now has built-in instrumentation to trace object store calls
+([#17207]). Toggle profiling
+with the [\object_store_profiling command] and inspect the exact `GET`/`LIST` requests issued during
+query execution:
+
+[datafusion-cli]: https://datafusion.apache.org/user-guide/cli/
+[\object_store_profiling command]: https://datafusion.apache.org/user-guide/cli/usage.html#commands
+
+```sql
+DataFusion CLI v51.0.0
+> \object_store_profiling trace
+ObjectStore Profile mode set to Trace
+> select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
++----------+
+| count(*) |
++----------+
+| 1000000  |
++----------+
+1 row(s) fetched.
+Elapsed 0.367 seconds.
+
+Object Store Profiling
+Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
+2025-11-19T21:10:43.476121+00:00 operation=Head duration=0.069763s path=hits_compatible/athena_partitioned/hits_1.parquet
+2025-11-19T21:10:43.545903+00:00 operation=Head duration=0.025859s path=hits_compatible/athena_partitioned/hits_1.parquet
+2025-11-19T21:10:43.571768+00:00 operation=Head duration=0.025684s path=hits_compatible/athena_partitioned/hits_1.parquet
+2025-11-19T21:10:43.597463+00:00 operation=Get duration=0.034194s size=524288 range: bytes=174440756-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet
+2025-11-19T21:10:43.705821+00:00 operation=Head duration=0.022029s path=hits_compatible/athena_partitioned/hits_1.parquet
+
+Summaries:
++-----------+----------+-----------+-----------+-----------+-----------+-------+
+| Operation | Metric   | min       | max       | avg       | sum       | count |
++-----------+----------+-----------+-----------+-----------+-----------+-------+
+| Get       | duration | 0.034194s | 0.034194s | 0.034194s | 0.034194s | 1     |
+| Get       | size     | 524288 B  | 524288 B  | 524288 B  | 524288 B  | 1     |
+| Head      | duration | 0.022029s | 0.069763s | 0.035834s | 0.143335s | 4     |
+| Head      | size     |           |           |           |           | 4     |
++-----------+----------+-----------+-----------+-----------+-----------+-------+
+```
+
+This makes it far easier to diagnose slow remote scans and validate caching
+strategies. Thanks to [BlakeOrth] for leading this effort.
+
+### `DESCRIBE <query>`
+
+`DESCRIBE` now works on arbitrary queries, returning the schema instead
+of being an alias for `EXPLAIN` ([#18234](https://github.com/apache/datafusion/issues/18234)). This brings DataFusion in line with engines
+like DuckDB and makes it easy to inspect the output schema of queries
+without executing them. Thanks to [djanderson] for leading this effort.
+
+[djanderson]: https://github.com/djanderson
+
+For example:
+
+```sql
+DataFusion CLI v51.0.0
+> create table t(a int, b varchar, c float) as values (1, 'a', 2.0);
+0 row(s) fetched.
+Elapsed 0.002 seconds.
+
+> DESCRIBE SELECT a, b, SUM(c) FROM t GROUP BY a, b;
+
++-------------+-----------+-------------+
+| column_name | data_type | is_nullable |
++-------------+-----------+-------------+
+| a           | Int32     | YES         |
+| b           | Utf8View  | YES         |
+| sum(t.c)    | Float64   | YES         |
++-------------+-----------+-------------+
+3 row(s) fetched.
+```
+
+
+### Named arguments in SQL functions
+
+DataFusion now understands [PostgreSQL-style named arguments] (`param => value`)
+for scalar, aggregate, and window functions ([#17379](https://github.com/apache/datafusion/issues/17379)). You can mix positional and named
+arguments in any order, and error messages now list parameter names to make
+diagnostics clearer. UDF authors can also expose parameter names so their
+functions benefit from the same syntax. Thanks to [timsaucer] and [bubulalabu] for leading this effort.
+
+[PostgreSQL-style named arguments]: https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html
+
+For example, you can pass arguments to functions like this:
+```sql
+SELECT power(exponent => 3.0, base => 2.0);
+```
+
+[timsaucer]: https://github.com/timsaucer
+[bubulalabu]: https://github.com/bubulalabu
+
+### Metrics improvements
+
+The output of [EXPLAIN ANALYZE] has been improved to include more metrics
+about execution time and memory usage of each operator ([#18217]).
+You can learn more about these new metrics in the [metrics user guide]. Thanks to
+[2010YOUY01] for leading this effort.
+
+
+[#18217]: https://github.com/apache/datafusion/issues/18217
+[2010YOUY01]: https://github.com/2010YOUY01
+
+The `51.0.0` release adds:
+
+- **Configuration**: adds a new option `datafusion.explain.analyze_level`, which can be set to `summary` for a concise output or `dev` for the full set of metrics (the previous default).
+- **For all major operators**: adds `output_bytes`, reporting how many bytes of data each operator produces.
+- **FilterExec**: adds a `selectivity` metric (`output_rows / input_rows`) to show how effective the filter is.
+- **AggregateExec**: 
+  - adds detailed timing metrics for group-ID computation, aggregate argument evaluation, aggregation work, and emitting final results.
+  - adds a `reduction_factor` metric (`output_rows / input_rows`) to show how much grouping reduces the data.
+- **NestedLoopJoinExec**: adds a `selectivity` metric (`output_rows / (left_rows * right_rows)`) to show how many combinations actually pass the join condition.
+- Several display formatting improvements were added to make `EXPLAIN ANALYZE` output easier to read.
+
+[EXPLAIN ANALYZE]: https://datafusion.apache.org/user-guide/sql/explain.html#explain-analyze
+[metrics user guide]: https://datafusion.apache.org/user-guide/metrics.html
+
+For example, the following query:
+```sql
+set datafusion.explain.analyze_level = summary
+
+explain analyze 
+select count(*) 
+from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' 
+where "URL" <> '';
+```
+
+Now shows easier-to-understand metrics such as:
+
+```text
+ metrics=[
+   output_rows=1000000, 
+   elapsed_compute=16ns, 
+   output_bytes=222.5 MB, 
+   files_ranges_pruned_statistics=16 total → 16 matched, 
+   row_groups_pruned_statistics=3 total → 3 matched, 
+   row_groups_pruned_bloom_filter=3 total → 3 matched, 
+   page_index_rows_pruned=0 total → 0 matched,
+   bytes_scanned=33661364,
+   metadata_load_time=4.243098ms, 
+]
+```
+
+## Upgrade Guide and Changelog
+
+Upgrading to 51.0.0 should be straightforward for most users. Please review the
+[Upgrade Guide]
+for details on breaking changes and code snippets to help with the transition.
+For a comprehensive list of all changes, please refer to the [changelog].
+
+## About DataFusion
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that uses
+[Apache Arrow] as its in-memory format. DataFusion is used by developers to
+create new, fast, data-centric systems such as databases, dataframe libraries,
+and machine learning and streaming applications. While [DataFusion’s primary
+design goal] is to accelerate the creation of other data-centric systems, it
+provides a reasonable experience directly out of the box as a [dataframe
+library], [Python library], and [command-line SQL tool].
+
+[apache datafusion]: https://datafusion.apache.org/
+[rust]: https://www.rust-lang.org/
+[apache arrow]: https://arrow.apache.org
+[DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html
+[python library]: https://datafusion.apache.org/python/
+[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+[Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html
+[zhuqi-lucas]: https://github.com/zhuqi-lucas
+[AdamGS]: https://github.com/AdamGS
+[simonvandel]: https://github.com/simonvandel
+[BlakeOrth]: https://github.com/BlakeOrth
+[CASE performance epic]: https://github.com/apache/datafusion/issues/18075
+[#18118]: https://github.com/apache/datafusion/issues/18118
+[#17501]: https://github.com/apache/datafusion/pull/17501
+[#17278]: https://github.com/apache/datafusion/pull/17278
+[#17207]: https://github.com/apache/datafusion/issues/17207
+[#17379]: https://github.com/apache/datafusion/issues/17379
+[etseidl]: https://github.com/etseidl
+[jhorstmann]: https://github.com/jhorstmann
+
+DataFusion's core thesis is that, as a community, together we can build much
+more advanced technology than any of us as individuals or companies could build
+alone. Without DataFusion, highly performant vectorized query engines would
+remain the domain of a few large companies and world-class research
+institutions. With DataFusion, we can all build on top of a shared foundation
+and focus on what makes our projects unique.
+
+## How to Get Involved
+
+DataFusion is not a project built or driven by a single person, company, or
+foundation. Rather, our community of users and contributors works together to
+build a shared technology that none of us could have built alone.
+
+If you are interested in joining us, we would love to have you. You can try out
+DataFusion on some of your own data and projects and let us know how it goes,
+contribute suggestions, documentation, bug reports, or a PR with documentation,
+tests, or code. A list of open issues suitable for beginners is [here], and you
+can find out how to reach us on the [communication doc].
+
+[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
+[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html