Skip to content

Commit e7a3738

Browse files
alamb2010YOUY01
andauthored
Blog post for DataFusion 51.0.0 (#124)
* Add blog post for DataFusion 51.0.0 * Rough draft from codex * add credits * Updates * update * updates * update * update * updates * more * comments * Apply suggestions from code review Co-authored-by: Yongting You <2010youy01@gmail.com> * Update performance chart * another pass * update * tweaks * Consolidate redundant sections --------- Co-authored-by: Yongting You <2010youy01@gmail.com>
1 parent 912c4ab commit e7a3738

File tree

3 files changed

+333
-0
lines changed

3 files changed

+333
-0
lines changed
Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
---
2+
layout: post
3+
title: Apache DataFusion 51.0.0 Released
4+
date: 2025-11-25
5+
author: pmc
6+
categories: [release]
7+
---
8+
9+
<!--
10+
{% comment %}
11+
Licensed to the Apache Software Foundation (ASF) under one or more
12+
contributor license agreements. See the NOTICE file distributed with
13+
this work for additional information regarding copyright ownership.
14+
The ASF licenses this file to you under the Apache License, Version 2.0
15+
(the "License"); you may not use this file except in compliance with
16+
the License. You may obtain a copy of the License at
17+
18+
http://www.apache.org/licenses/LICENSE-2.0
19+
20+
Unless required by applicable law or agreed to in writing, software
21+
distributed under the License is distributed on an "AS IS" BASIS,
22+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
23+
See the License for the specific language governing permissions and
24+
limitations under the License.
25+
{% endcomment %}
26+
-->
27+
28+
[TOC]
29+
30+
## Introduction
31+
32+
We are proud to announce the release of [DataFusion 51.0.0]. This post highlights
33+
some of the major improvements since [DataFusion 50.0.0]. The complete list of
34+
changes is available in the [changelog]. Thanks to the [128 contributors] for
35+
making this release possible.
36+
37+
[DataFusion 51.0.0]: https://crates.io/crates/datafusion/51.0.0
38+
[DataFusion 50.0.0]: https://datafusion.apache.org/blog/2025/09/29/datafusion-50.0.0/
39+
[changelog]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md
40+
[128 contributors]: https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md#credits
41+
42+
## Performance Improvements 🚀
43+
We continue to make significant performance improvements in DataFusion, both in
44+
the core engine and in the Parquet reader.
45+
46+
<img
47+
src="/blog/images/datafusion-51.0.0/performance_over_time_clickbench.png"
48+
width="100%"
49+
class="img-responsive"
50+
alt="Performance over time"
51+
/>
52+
53+
**Figure 1**: Average and median normalized query execution times for ClickBench queries for DataFusion 51.0.0 compared to previous releases.
54+
Query times are normalized using the ClickBench definition. See the
55+
[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/)
56+
for more details.
57+
58+
### Faster `CASE` expression evaluation
59+
60+
This release builds on the [CASE performance epic] with significant improvements.
61+
Expressions short‑circuit earlier, reuse partial results, and avoid unnecessary
62+
scattering, speeding up common ETL patterns. Thanks to [pepijnve], [chenkovsky],
63+
and [petern48] for leading this effort. We hope to share more details on our
64+
implementation in a future post.
65+
66+
[pepijnve]: https://github.com/pepijnve
67+
[chenkovsky]: https://github.com/chenkovsky
68+
[petern48]: https://github.com/petern48
69+
70+
### Better Defaults for Remote Parquet Reads
71+
72+
By default, DataFusion now always fetches the last 512KB (configurable) of [Apache Parquet] files
73+
which usually includes the footer and metadata ([#18118]). This
74+
change typically avoids 2 I/O requests for each Parquet. While this
75+
setting has existed in DataFusion for many years, it was not previously enabled
76+
by default. Users can tune the number of bytes fetched in the initial I/O
77+
request via the `datafusion.execution.parquet.metadata_size_hint` [config setting]. Thanks to
78+
[zhuqi-lucas] for leading this effort.
79+
80+
[config setting]: https://datafusion.apache.org/user-guide/configs.html
81+
[apache parquet]: https://parquet.apache.org/
82+
83+
### Faster Parquet metadata parsing
84+
85+
DataFusion 51 also includes the latest Parquet reader from
86+
[Arrow Rust 57.0.0], which parses Parquet metadata significantly faster. This is
87+
especially beneficial for workloads with many small Parquet files and scenarios
88+
where startup time or low latency is important. You can read more about the upstream work by
89+
[etseidl] and [jhorstmann] that enabled these improvements in the [Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser] blog.
90+
91+
<img
92+
src="/blog/images/datafusion-51.0.0/arrow-57-metadata-parsing.png"
93+
width="100%"
94+
class="img-responsive"
95+
alt="Metadata Parsing Performance Improvements in Arrow/Parquet 57"
96+
/>
97+
98+
**Figure 2**: Metadata parsing performance improvements in Arrow/Parquet 57.0.0.
99+
100+
[Arrow Rust 57.0.0]: https://arrow.apache.org/blog/2025/10/30/arrow-rs-57.0.0/
101+
[Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser]: https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/
102+
103+
104+
105+
## New Features ✨
106+
107+
### Decimal32/Decimal64 support
108+
109+
The new Arrow types `Decimal32` and `Decimal64` are now supported in DataFusion
110+
([#17501]), including aggregations such as `SUM`, `AVG`, `MIN/MAX`, and window
111+
functions. Thanks to [AdamGS] for leading this effort.
112+
113+
114+
### SQL Pipe Operators
115+
116+
DataFusion now supports the SQL pipe operator syntax
117+
([#17278]), enabling inline transforms such as:
118+
119+
```sql
120+
SELECT * FROM t
121+
|> WHERE a > 10
122+
|> ORDER BY b
123+
|> LIMIT 5;
124+
```
125+
126+
This syntax, [popularized by Google BigQuery], keeps multi-step transformations concise while preserving regular
127+
SQL semantics. Thanks to [simonvandel] for leading this effort.
128+
129+
[popularized by Google BigQuery]: https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/pipe-syntax
130+
131+
### I/O Profiling in `datafusion-cli`
132+
133+
[datafusion-cli] now has built-in instrumentation to trace object store calls
134+
([#17207]). Toggle profiling
135+
with the [\object_store_profiling command] and inspect the exact `GET`/`LIST` requests issued during
136+
query execution:
137+
138+
[datafusion-cli]: https://datafusion.apache.org/user-guide/cli/
139+
[\object_store_profiling command]: https://datafusion.apache.org/user-guide/cli/usage.html#commands
140+
141+
```sql
142+
DataFusion CLI v51.0.0
143+
> \object_store_profiling trace
144+
ObjectStore Profile mode set to Trace
145+
> select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
146+
+----------+
147+
| count(*) |
148+
+----------+
149+
| 1000000 |
150+
+----------+
151+
1 row(s) fetched.
152+
Elapsed 0.367 seconds.
153+
154+
Object Store Profiling
155+
Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
156+
2025-11-19T21:10:43.476121+00:00 operation=Head duration=0.069763s path=hits_compatible/athena_partitioned/hits_1.parquet
157+
2025-11-19T21:10:43.545903+00:00 operation=Head duration=0.025859s path=hits_compatible/athena_partitioned/hits_1.parquet
158+
2025-11-19T21:10:43.571768+00:00 operation=Head duration=0.025684s path=hits_compatible/athena_partitioned/hits_1.parquet
159+
2025-11-19T21:10:43.597463+00:00 operation=Get duration=0.034194s size=524288 range: bytes=174440756-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet
160+
2025-11-19T21:10:43.705821+00:00 operation=Head duration=0.022029s path=hits_compatible/athena_partitioned/hits_1.parquet
161+
162+
Summaries:
163+
+-----------+----------+-----------+-----------+-----------+-----------+-------+
164+
| Operation | Metric | min | max | avg | sum | count |
165+
+-----------+----------+-----------+-----------+-----------+-----------+-------+
166+
| Get | duration | 0.034194s | 0.034194s | 0.034194s | 0.034194s | 1 |
167+
| Get | size | 524288 B | 524288 B | 524288 B | 524288 B | 1 |
168+
| Head | duration | 0.022029s | 0.069763s | 0.035834s | 0.143335s | 4 |
169+
| Head | size | | | | | 4 |
170+
+-----------+----------+-----------+-----------+-----------+-----------+-------+
171+
```
172+
173+
This makes it far easier to diagnose slow remote scans and validate caching
174+
strategies. Thanks to [BlakeOrth] for leading this effort.
175+
176+
### `DESCRIBE <query>`
177+
178+
`DESCRIBE` now works on arbitrary queries, returning the schema instead
179+
of being an alias for `EXPLAIN` ([#18234](https://github.com/apache/datafusion/issues/18234)). This brings DataFusion in line with engines
180+
like DuckDB and makes it easy to inspect the output schema of queries
181+
without executing them. Thanks to [djanderson] for leading this effort.
182+
183+
[djanderson]: https://github.com/djanderson
184+
185+
For example:
186+
187+
```sql
188+
DataFusion CLI v51.0.0
189+
> create table t(a int, b varchar, c float) as values (1, 'a', 2.0);
190+
0 row(s) fetched.
191+
Elapsed 0.002 seconds.
192+
193+
> DESCRIBE SELECT a, b, SUM(c) FROM t GROUP BY a, b;
194+
195+
+-------------+-----------+-------------+
196+
| column_name | data_type | is_nullable |
197+
+-------------+-----------+-------------+
198+
| a | Int32 | YES |
199+
| b | Utf8View | YES |
200+
| sum(t.c) | Float64 | YES |
201+
+-------------+-----------+-------------+
202+
3 row(s) fetched.
203+
```
204+
205+
206+
### Named arguments in SQL functions
207+
208+
DataFusion now understands [PostgreSQL-style named arguments] (`param => value`)
209+
for scalar, aggregate, and window functions ([#17379](https://github.com/apache/datafusion/issues/17379)). You can mix positional and named
210+
arguments in any order, and error messages now list parameter names to make
211+
diagnostics clearer. UDF authors can also expose parameter names so their
212+
functions benefit from the same syntax. Thanks to [timsaucer] and [bubulalabu] for leading this effort.
213+
214+
[PostgreSQL-style named arguments]: https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html
215+
216+
For example, you can pass arguments to functions like this:
217+
```sql
218+
SELECT power(exponent => 3.0, base => 2.0);
219+
```
220+
221+
[timsaucer]: https://github.com/timsaucer
222+
[bubulalabu]: https://github.com/bubulalabu
223+
224+
### Metrics improvements
225+
226+
The output of [EXPLAIN ANALYZE] has been improved to include more metrics
227+
about execution time and memory usage of each operator ([#18217]).
228+
You can learn more about these new metrics in the [metrics user guide]. Thanks to
229+
[2010YOUY01] for leading this effort.
230+
231+
232+
[#18217]: https://github.com/apache/datafusion/issues/18217
233+
[2010YOUY01]: https://github.com/2010YOUY01
234+
235+
The `51.0.0` release adds:
236+
237+
- **Configuration**: adds a new option `datafusion.explain.analyze_level`, which can be set to `summary` for a concise output or `dev` for the full set of metrics (the previous default).
238+
- **For all major operators**: adds `output_bytes`, reporting how many bytes of data each operator produces.
239+
- **FilterExec**: adds a `selectivity` metric (`output_rows / input_rows`) to show how effective the filter is.
240+
- **AggregateExec**:
241+
- adds detailed timing metrics for group-ID computation, aggregate argument evaluation, aggregation work, and emitting final results.
242+
- adds a `reduction_factor` metric (`output_rows / input_rows`) to show how much grouping reduces the data.
243+
- **NestedLoopJoinExec**: adds a `selectivity` metric (`output_rows / (left_rows * right_rows)`) to show how many combinations actually pass the join condition.
244+
- Several display formatting improvements were added to make `EXPLAIN ANALYZE` output easier to read.
245+
246+
[EXPLAIN ANALYZE]: https://datafusion.apache.org/user-guide/sql/explain.html#explain-analyze
247+
[metrics user guide]: https://datafusion.apache.org/user-guide/metrics.html
248+
249+
For example, the following query:
250+
```sql
251+
set datafusion.explain.analyze_level = summary
252+
253+
explain analyze
254+
select count(*)
255+
from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
256+
where "URL" <> '';
257+
```
258+
259+
Now shows easier-to-understand metrics such as:
260+
261+
```text
262+
metrics=[
263+
output_rows=1000000,
264+
elapsed_compute=16ns,
265+
output_bytes=222.5 MB,
266+
files_ranges_pruned_statistics=16 total → 16 matched,
267+
row_groups_pruned_statistics=3 total → 3 matched,
268+
row_groups_pruned_bloom_filter=3 total → 3 matched,
269+
page_index_rows_pruned=0 total → 0 matched,
270+
bytes_scanned=33661364,
271+
metadata_load_time=4.243098ms,
272+
]
273+
```
274+
275+
## Upgrade Guide and Changelog
276+
277+
Upgrading to 51.0.0 should be straightforward for most users. Please review the
278+
[Upgrade Guide]
279+
for details on breaking changes and code snippets to help with the transition.
280+
For a comprehensive list of all changes, please refer to the [changelog].
281+
282+
## About DataFusion
283+
284+
[Apache DataFusion] is an extensible query engine, written in [Rust], that uses
285+
[Apache Arrow] as its in-memory format. DataFusion is used by developers to
286+
create new, fast, data-centric systems such as databases, dataframe libraries,
287+
and machine learning and streaming applications. While [DataFusion’s primary
288+
design goal] is to accelerate the creation of other data-centric systems, it
289+
provides a reasonable experience directly out of the box as a [dataframe
290+
library], [Python library], and [command-line SQL tool].
291+
292+
[apache datafusion]: https://datafusion.apache.org/
293+
[rust]: https://www.rust-lang.org/
294+
[apache arrow]: https://arrow.apache.org
295+
[DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals
296+
[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html
297+
[python library]: https://datafusion.apache.org/python/
298+
[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/
299+
[Upgrade Guide]: https://datafusion.apache.org/library-user-guide/upgrading.html
300+
[zhuqi-lucas]: https://github.com/zhuqi-lucas
301+
[AdamGS]: https://github.com/AdamGS
302+
[simonvandel]: https://github.com/simonvandel
303+
[BlakeOrth]: https://github.com/BlakeOrth
304+
[CASE performance epic]: https://github.com/apache/datafusion/issues/18075
305+
[#18118]: https://github.com/apache/datafusion/issues/18118
306+
[#17501]: https://github.com/apache/datafusion/pull/17501
307+
[#17278]: https://github.com/apache/datafusion/pull/17278
308+
[#17207]: https://github.com/apache/datafusion/issues/17207
309+
[#17379]: https://github.com/apache/datafusion/issues/17379
310+
[etseidl]: https://github.com/etseidl
311+
[jhorstmann]: https://github.com/jhorstmann
312+
313+
DataFusion's core thesis is that, as a community, together we can build much
314+
more advanced technology than any of us as individuals or companies could build
315+
alone. Without DataFusion, highly performant vectorized query engines would
316+
remain the domain of a few large companies and world-class research
317+
institutions. With DataFusion, we can all build on top of a shared foundation
318+
and focus on what makes our projects unique.
319+
320+
## How to Get Involved
321+
322+
DataFusion is not a project built or driven by a single person, company, or
323+
foundation. Rather, our community of users and contributors works together to
324+
build a shared technology that none of us could have built alone.
325+
326+
If you are interested in joining us, we would love to have you. You can try out
327+
DataFusion on some of your own data and projects and let us know how it goes,
328+
contribute suggestions, documentation, bug reports, or a PR with documentation,
329+
tests, or code. A list of open issues suitable for beginners is [here], and you
330+
can find out how to reach us on the [communication doc].
331+
332+
[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
333+
[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html
76.6 KB
Loading
60.5 KB
Loading

0 commit comments

Comments
 (0)