Skip to content

Commit a5520d5

Browse files
alambakurmustafakevinjqliu
authored
Blog: Optimizing SQL and DataFrames (#74)
* Blog: Optimizing SQL and DataFrames * Fix links, minor changes * Fix links * Clarify how the query planner / optimizer / dataframes are related * Apply suggestions from code review Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> --------- Co-authored-by: Mustafa Akur <akurmustafa@gmail.com> Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
1 parent b7d6a21 commit a5520d5

15 files changed

+783
-0
lines changed
Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
---
2+
layout: post
3+
title: Optimizing SQL (and DataFrames) in DataFusion, Part 1: Query Optimization Overview
4+
date: 2025-06-15
5+
author: alamb, akurmustafa
6+
categories: [core]
7+
---
8+
9+
<!--
10+
{% comment %}
11+
Licensed to the Apache Software Foundation (ASF) under one or more
12+
contributor license agreements. See the NOTICE file distributed with
13+
this work for additional information regarding copyright ownership.
14+
The ASF licenses this file to you under the Apache License, Version 2.0
15+
(the "License"); you may not use this file except in compliance with
16+
the License. You may obtain a copy of the License at
17+
18+
http://www.apache.org/licenses/LICENSE-2.0
19+
20+
Unless required by applicable law or agreed to in writing, software
21+
distributed under the License is distributed on an "AS IS" BASIS,
22+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
23+
See the License for the specific language governing permissions and
24+
limitations under the License.
25+
{% endcomment %}
26+
-->
27+
28+
29+
30+
*Note: this blog was originally published [on the InfluxData blog](https://www.influxdata.com/blog/optimizing-sql-dataframes-part-one/)*
31+
32+
33+
## Introduction
34+
35+
Sometimes Query Optimizers are seen as a sort of black magic, [“the most
36+
challenging problem in computer
37+
science,”](https://15799.courses.cs.cmu.edu/spring2025/) according to Father
38+
Pavlo, or some behind-the-scenes player. We believe this perception is because:
39+
40+
41+
1. One must implement the rest of a database system (data storage, transactions,
42+
SQL parser, expression evaluation, plan execution, etc.) **before** the
43+
optimizer becomes critical<sup id="fn5">[5](#footnote5)</sup>.
44+
45+
2. Some parts of the optimizer are tightly tied to the rest of the system (e.g.,
46+
storage or indexes), so many classic optimizers are described with
47+
system-specific terminology.
48+
49+
3. Some optimizer tasks, such as access path selection and join order are known
50+
challenges and not yet solved (practically)—maybe they really do require
51+
black magic 🤔.
52+
53+
However, Query Optimizers are no more complicated in theory or practice than other parts of a database system, as we will argue in a series of posts:
54+
55+
**Part 1: (this post)**:
56+
57+
* Review what a Query Optimizer is, what it does, and why you need one for SQL and DataFrames.
58+
* Describe how industrial Query Optimizers are structured and standard optimization classes.
59+
60+
**Part 2:**
61+
62+
* Describe the optimization categories with examples and pointers to implementations.
63+
* Describe [Apache DataFusion](https://datafusion.apache.org/)’s rationale and approach to query optimization, specifically for access path and join ordering.
64+
65+
After reading these blogs, we hope people will use DataFusion to:
66+
67+
1. Build their own system specific optimizers.
68+
2. Perform practical academic research on optimization (especially researchers
69+
working on new optimizations / join ordering—looking at you [CMU
70+
15-799](https://15799.courses.cs.cmu.edu/spring2025/), next year).
71+
72+
73+
## Query Optimizer Background
74+
75+
The key pitch for querying databases, and likely the key to the longevity of SQL
76+
(despite people’s love/hate relationship—see [SQL or Death? Seminar Series –
77+
Spring 2025](https://db.cs.cmu.edu/seminar2025/)), is that it disconnects the
78+
`WHAT` you want to compute from the `HOW` to do it. SQL is a *declarative*
79+
language—it describes what answers are desired rather than an *imperative*
80+
language such as Python, where you describe how to do the computation as shown
81+
in Figure 1.
82+
83+
<img src="/blog/images/optimizing-sql-dataframes/query-execution.png" width="80%" class="img-responsive" alt="Fig 1: Query Execution."/>
84+
85+
**Figure 1**: Query Execution: Users describe the answer they want using either
86+
SQL or a DataFrame. For SQL, a Query Planner translates the parsed query
87+
into an *initial plan*. The DataFrame API creates an initial plan directly.
88+
The initial plan is correct, but slow. Then, the Query
89+
Optimizer rewrites the initial plan into an *optimized plan*, which computes
90+
the same results but faster and more efficiently. Finally, the Execution Engine
91+
executes the optimized plan producing results.
92+
93+
## SQL, DataFrames, LogicalPlan Equivalence
94+
95+
Given their name, it is not surprising that Query Optimizers can improve the
96+
performance of SQL queries. However, it is under-appreciated that this also
97+
applies to DataFrame style APIs.
98+
99+
Classic DataFrame systems such as [pandas] and [Polars] (by default) execute
100+
eagerly and thus have limited opportunities for optimization. However, more
101+
modern APIs such as [Polars' lazy API], [Apache Spark's DataFrame]. and
102+
[DataFusion's DataFrame] are much faster as they use the design shown in Figure
103+
1 and apply many query optimization techniques.
104+
105+
[pandas]: https://pandas.pydata.org/
106+
[Polars]: https://pola.rs/
107+
[Polars' lazy API]: https://docs.pola.rs/user-guide/lazy/using/
108+
[Apache Spark's DataFrame]: https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes),
109+
[DataFusion's DataFrame]: https://datafusion.apache.org/user-guide/dataframe.html
110+
111+
## Example of Query Optimizer
112+
113+
This section motivates the value of a Query Optimizer with an example. Let’s say
114+
you have some observations of animal behavior, as illustrated in Table 1.
115+
116+
<img src="/blog/images/optimizing-sql-dataframes/table1.png" width="75%" class="img-responsive" alt="Table 1: Observational Data."/>
117+
118+
**Table 1**: Example observational data.
119+
120+
If the user wants to know the average population for some species in the last
121+
month, a user can write a SQL query or a DataFrame such as the following:
122+
123+
SQL:
124+
125+
```sql
126+
SELECT location, AVG(population)
127+
FROM observations
128+
WHERE species = ‘contrarian spider’ AND
129+
observation_time >= now() - interval '1 month'
130+
GROUP BY location
131+
```
132+
133+
DataFrame:
134+
135+
```rust
136+
df.scan("observations")
137+
.filter(col("species").eq("contrarian spider"))
138+
.filter(col("observation_time").ge(now()).sub(interval('1 month')))
139+
.agg(vec![col(location)], vec![avg(col("population")])
140+
```
141+
142+
Within DataFusion, both the SQL and DataFrame are translated into the same
143+
[LogicalPlan], atree of relational operators.This is a fancy way of
144+
saying data flow graphs where the edges represent tabular data (rows + columns)
145+
and the nodes represent a transformation (see [this DataFusion overview video]
146+
for more details). The initial `LogicalPlan` for the queries above is shown in
147+
Figure 2.
148+
149+
[LogicalPlan]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html
150+
[this DataFusion overview video]: https://youtu.be/EzZTLiSJnhY
151+
152+
<img src="/blog/images/optimizing-sql-dataframes/initial-logical-plan.png" width="72%" class="img-responsive" alt="Fig 2: Initial Logical Plan."/>
153+
154+
**Figure 2**: Example initial `LogicalPlan` for SQL and DataFrame query. The
155+
plan is read from bottom to top, computing the results in each step.
156+
157+
The optimizer's job is to take this query plan and rewrite it into an alternate
158+
plan that computes the same results but faster, such as the one shown in Figure
159+
3.
160+
161+
<img src="/blog/images/optimizing-sql-dataframes/optimized-logical-plan.png" width="80%" class="img-responsive" alt="Fig 3: Optimized Logical Plan."/>
162+
163+
**Figure 3**: An example optimized plan that computes the same result as the
164+
plan in Figure 2 more efficiently. The diagram highlights where the optimizer
165+
has applied *Projection Pushdown*, *Filter Pushdown*, and *Constant Evaluation*.
166+
Note that this is a simplified example for explanatory purposes, and actual
167+
optimizers such as the one in DataFusion perform additional tasks such as
168+
choosing specific aggregation algorithms.
169+
170+
171+
## Query Optimizer Implementation
172+
173+
Industrial optimizers, such as
174+
DataFusions ([source](https://github.com/apache/datafusion/tree/334d6ec50f36659403c96e1bffef4228be7c458e/datafusion/optimizer/src)),
175+
ClickHouse ([source](https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes), [source](https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations)),
176+
DuckDB ([source](https://github.com/duckdb/duckdb/tree/4afa85c6a4dacc39524d1649fd8eb8c19c28ad14/src/optimizer)),
177+
and Apache Spark ([source](https://github.com/apache/spark/tree/7bc8e99cde424c59b98fe915e3fdaaa30beadb76/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer)),
178+
are implemented as a series of passes or rules that rewrite a query plan. The
179+
overall optimizer is composed of a sequence of these rules,<sup id="fn6">[6](#footnote6)</sup> as shown in
180+
Figure 4. The specific order of the rules also often matters, but we will not
181+
discuss this detail in this post.
182+
183+
A multi-pass design is standard because it helps:
184+
185+
1. Understand, implement, and test each pass in isolation
186+
2. Easily extend the optimizer by adding new passes
187+
188+
<img src="/blog/images/optimizing-sql-dataframes/optimizer-passes.png" width="80%" class="img-responsive" alt="Fig 4: Query Optimizer Passes."/>
189+
190+
**Figure 4**: Query Optimizers are implemented as a series of rules that each
191+
rewrite the query plan. Each rules algorithm is expressed as a transformation
192+
of a previous plan.
193+
194+
There are three major classes of optimizations in industrial optimizers:
195+
196+
1. **Always Optimizations**: These are always good to do and thus are always
197+
applied. This class of optimization includes expression simplification,
198+
predicate pushdown, and limit pushdown. These optimizations are typically
199+
simple in theory, though they require nontrivial amounts of code and tests to
200+
implement in practice.
201+
202+
2. **Engine Specific Optimizations: **These optimizations take advantage of
203+
specific engine features, such as how expressions are evaluated or what
204+
particular hash or join implementations are available.
205+
206+
3. **Access Path and Join Order Selection**: These passes choose one access
207+
method per table and a join order for execution, typically using heuristics
208+
and a cost model to make tradeoffs between the options. Databases often have
209+
multiple ways to access the data (e.g., index scan or full-table scan), as
210+
well as many potential orders to combine (join) multiple tables. These
211+
methods compute the same result but can vary drastically in performance.
212+
213+
This brings us to the end of Part 1. In Part 2, we will explain these classes of
214+
optimizations in more detail and provide examples of how they are implemented in
215+
DataFusion and other systems.
216+
217+
# About the Authors
218+
219+
[Andrew Lamb](https://www.linkedin.com/in/andrewalamb/) is a Staff Engineer at
220+
[InfluxData](https://www.influxdata.com/) and an [Apache
221+
DataFusion](https://datafusion.apache.org/) PMC member. A Database Optimizer
222+
connoisseur, he worked on the [Vertica Analytic
223+
Database](https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf) Query
224+
Optimizer for six years, has several granted US patents related to query
225+
optimization<sup id="fn1">[1](#footnote1)</sup>, co-authored several papers<sup id="fn2">[2](#footnote2)</sup> about the topic (including in
226+
VLDB 2024<sup id="fn3">[3](#footnote3)</sup>), and spent several weeks<sup id="fn4">[4](#footnote4)</sup> deeply geeking out about this topic
227+
with other experts (thank you Dagstuhl).
228+
229+
[Mustafa Akur](https://www.linkedin.com/in/akurmustafa/) is a PhD Student at
230+
[OHSU](https://www.ohsu.edu/) Knight Cancer Institute and an [Apache
231+
DataFusion](https://datafusion.apache.org/) PMC member. He was previously a
232+
Software Developer at [Synnada](https://www.synnada.ai/) where he contributed
233+
significant features to the DataFusion optimizer, including many [sort-based
234+
optimizations](https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/).
235+
236+
237+
## Notes
238+
239+
<a id="footnote1"></a><sup>[1]</sup> *Modular Query Optimizer, US 8,312,027 · Issued Nov 13, 2012*, Query Optimizer with schema conversion US 8,086,598 · Issued Dec 27, 2011
240+
241+
<a id="footnote2"></a><sup>[2]</sup> [The Vertica Query Optimizer: The case for specialized Query Optimizers](https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers)
242+
243+
<a id="footnote3"></a><sup>[3]</sup> [https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)
244+
245+
<a id="footnote4"></a><sup>[4]</sup> [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101), [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111), [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321)
246+
247+
<a id="footnote5"></a><sup>[5]</sup> And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the[ hype cycle](https://en.wikipedia.org/wiki/Gartner_hype_cycle) has worn off and it is likely in the trough of disappointment.
248+
249+
<a id="footnote6"></a><sup>[6]</sup> Often systems will classify these passes into different categories, but I am simplifying here
250+

0 commit comments

Comments
 (0)