Skip to content

Commit 577fcb2

Browse files
committed
Blog: limit pruning
1 parent 5cd77cd commit 577fcb2

File tree

8 files changed

+803
-0
lines changed

8 files changed

+803
-0
lines changed
Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
---
2+
layout: post
3+
title: Turning LIMIT into an I/O Optimization: Inside DataFusion's Limit Pruning
4+
date: 2026-03-10
5+
author: xudong
6+
categories: [features]
7+
---
8+
<!--
9+
{% comment %}
10+
Licensed to the Apache Software Foundation (ASF) under one or more
11+
contributor license agreements. See the NOTICE file distributed with
12+
this work for additional information regarding copyright ownership.
13+
The ASF licenses this file to you under the Apache License, Version 2.0
14+
(the "License"); you may not use this file except in compliance with
15+
the License. You may obtain a copy of the License at
16+
17+
http://www.apache.org/licenses/LICENSE-2.0
18+
19+
Unless required by applicable law or agreed to in writing, software
20+
distributed under the License is distributed on an "AS IS" BASIS,
21+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
22+
See the License for the specific language governing permissions and
23+
limitations under the License.
24+
{% endcomment %}
25+
-->
26+
27+
[TOC]
28+
29+
<style>
30+
figure {
31+
margin: 20px 0;
32+
}
33+
34+
figure img {
35+
display: block;
36+
max-width: 80%;
37+
margin: auto;
38+
}
39+
40+
figcaption {
41+
font-style: italic;
42+
color: #555;
43+
font-size: 0.9em;
44+
max-width: 80%;
45+
margin: auto;
46+
text-align: center;
47+
}
48+
</style>
49+
50+
*Xudong Wang, [Massive](https://www.massive.com/)*
51+
52+
Reading data efficiently means touching as little data as possible. The fastest I/O is the I/O you never make. This sounds obvious, but making it happen in practice requires careful engineering at every layer of the query engine. [Apache DataFusion] achieves this through a multi-layer **pruning pipeline** — a series of stages that progressively narrow down the data before decoding a single row.
53+
54+
In this post, we describe a new optimization called **limit pruning** that makes this pipeline aware of SQL `LIMIT` clauses. By identifying row groups where *every* row is guaranteed to match the predicate, DataFusion can satisfy a `LIMIT` query without ever touching partially matching row groups — eliminating wasted I/O entirely.
55+
56+
This work was inspired by the "Pruning for LIMIT Queries" section of Snowflake's paper [*Pruning in Snowflake: Working Smarter, Not Harder*](https://arxiv.org/pdf/2504.11540).
57+
58+
## DataFusion's Pruning Pipeline
59+
60+
Before diving into limit pruning, let's understand the full pruning pipeline. DataFusion scans Parquet data through a series of increasingly fine-grained filters, each one eliminating data so the next stage processes less:
61+
62+
<figure>
63+
<img src="/blog/images/limit-pruning/pruning-phases.svg" width="80%" alt="Three phases of DataFusion's pruning pipeline"/>
64+
<figcaption>Figure 1: The three phases of DataFusion's pruning pipeline — from directories down to individual rows.</figcaption>
65+
</figure>
66+
67+
### Phase 1: High-Level Discovery
68+
69+
- **Partition Pruning**: The `ListingTable` component evaluates filters that depend only on partition columns — things like `year`, `month`, or `region` encoded in directory paths (e.g., `s3://data/year=2024/month=01/`). Irrelevant directories are eliminated before we even open a file.
70+
- **File Stats Pruning**: The `FilePruner` checks file-level min/max and null-count statistics. If these statistics prove that a file cannot satisfy the predicate, we drop it entirely — no need to read row group metadata.
71+
72+
### Phase 2: Row Group Statistics
73+
74+
For each surviving file, DataFusion reads row group metadata and classifies each row group into one of three states:
75+
76+
<figure>
77+
<img src="/blog/images/limit-pruning/row-group-states.svg" width="80%" alt="Row group classification: not matching, partially matching, fully matching"/>
78+
<figcaption>Figure 2: Row groups are classified into three states based on their statistics.</figcaption>
79+
</figure>
80+
81+
- **Not Matching (Skipped)**: Statistics prove no rows can match. The row group is ignored completely.
82+
- **Partially Matching**: Statistics cannot rule out matching rows, but also cannot guarantee them. These groups might be scanned and verified row by row later.
83+
- **Fully Matching**: Statistics prove that *every single row* in the group satisfies the predicate. This state is key to making limit pruning possible.
84+
85+
Additionally, **bloom filters** could eliminate row groups for equality and `IN`-list predicates at this stage.
86+
87+
### Phase 3: Granular Pruning
88+
89+
The final phase goes even deeper:
90+
91+
- **Page Index Pruning**: Parquet pages have their own min/max statistics. DataFusion uses these to skip individual data pages within a surviving row group.
92+
- **Late Materialization (Row Filtering)**: Instead of decoding all columns at once, DataFusion decodes the cheapest, most selective columns first. It filters rows using those columns, then only decodes the remaining columns for surviving rows.
93+
94+
## The Problem: LIMIT Was Ignored
95+
96+
Before limit pruning, all of these stages worked well — but the pruning pipeline had **no awareness of `LIMIT`**. Consider a query like:
97+
98+
```sql
99+
SELECT * FROM tracking_data
100+
WHERE species LIKE 'Alpine%' AND s >= 50
101+
LIMIT 3
102+
```
103+
104+
Even when fully matched row groups alone contain enough rows to satisfy the `LIMIT`, the scan would still visit partially matching groups — decoding data that might contribute zero qualifying rows.
105+
106+
<figure>
107+
<img src="/blog/images/limit-pruning/wasted-io.svg" width="80%" alt="Traditional pruning decodes partially matching groups with no LIMIT awareness"/>
108+
<figcaption>Figure 3: Without limit awareness, partially matching groups are scanned even when fully matched groups already have enough rows.</figcaption>
109+
</figure>
110+
111+
If five fully matched rows in a fully matched group already satisfy `LIMIT 5`, why bother decoding groups where we're not even sure any rows qualify?
112+
113+
## The Solution: Limit-Aware Pruning
114+
115+
The solution adds a new step in the pruning pipeline — right after row group pruning and before page index pruning:
116+
117+
<figure>
118+
<img src="/blog/images/limit-pruning/pruning-pipeline.svg" width="80%" alt="Pruning pipeline with limit pruning highlighted"/>
119+
<figcaption>Figure 4: Limit pruning is inserted between row group and page index pruning.</figcaption>
120+
</figure>
121+
122+
The idea is simple: **if fully matched row groups already contain enough rows to satisfy the `LIMIT`, rewrite the access plan to scan only those groups and skip everything else.**
123+
124+
This optimization is applied only when the query is a pure limit query with no `ORDER BY`, because reordering which groups we scan could change the output ordering of the results. In the implementation, this check is expressed as:
125+
126+
```rust
127+
// Prune by limit if limit is set and order is not sensitive
128+
if let (Some(limit), false) = (limit, preserve_order) {
129+
row_groups.prune_by_limit(limit, rg_metadata, &file_metrics);
130+
}
131+
```
132+
133+
## Mechanism: Detecting Fully Matched Row Groups
134+
135+
The core insight is **predicate negation**. To determine if every row in a row group satisfies the predicate, we:
136+
137+
1. Negate the original predicate
138+
2. Simplify the negated expression
139+
3. Evaluate the negation against the row group's statistics
140+
4. If the negation is *pruned* (proven impossible), then the original predicate holds for every row
141+
142+
<figure>
143+
<img src="/blog/images/limit-pruning/fully-matched-detection.svg" width="80%" alt="Fully matched detection via predicate negation"/>
144+
<figcaption>Figure 5: If the negated predicate is impossible according to row group stats, all rows must match the original predicate.</figcaption>
145+
</figure>
146+
147+
In DataFusion's codebase, this logic lives in `identify_fully_matched_row_groups` ([row_group_filter.rs]):
148+
149+
```rust
150+
fn identify_fully_matched_row_groups(
151+
&mut self,
152+
candidate_row_group_indices: &[usize],
153+
arrow_schema: &Schema,
154+
parquet_schema: &SchemaDescriptor,
155+
groups: &[RowGroupMetaData],
156+
predicate: &PruningPredicate,
157+
metrics: &ParquetFileMetrics,
158+
) {
159+
// Create the inverted predicate: NOT(original)
160+
let inverted_expr = Arc::new(NotExpr::new(
161+
Arc::clone(predicate.orig_expr()),
162+
));
163+
164+
// Simplify: e.g., NOT(c1 = 0) → c1 != 0
165+
let simplifier = PhysicalExprSimplifier::new(arrow_schema);
166+
let Ok(inverted_expr) = simplifier.simplify(inverted_expr) else {
167+
return;
168+
};
169+
170+
let Ok(inverted_predicate) = PruningPredicate::try_new(
171+
inverted_expr,
172+
Arc::clone(predicate.schema()),
173+
) else {
174+
return;
175+
};
176+
177+
// Evaluate inverted predicate against row group stats
178+
let Ok(inverted_values) =
179+
inverted_predicate.prune(&inverted_pruning_stats)
180+
else {
181+
return;
182+
};
183+
184+
for (i, &original_idx) in
185+
candidate_row_group_indices.iter().enumerate()
186+
{
187+
// If negation is pruned (false), all rows match original
188+
if !inverted_values[i] {
189+
self.is_fully_matched[original_idx] = true;
190+
}
191+
}
192+
}
193+
```
194+
195+
## Mechanism: Rewriting the Access Plan
196+
197+
Once we know which row groups are fully matched, the limit pruning algorithm is straightforward:
198+
199+
<figure>
200+
<img src="/blog/images/limit-pruning/limit-rewrite-algorithm.svg" width="80%" alt="Limit pruning access plan rewrite algorithm"/>
201+
<figcaption>Figure 6: The algorithm iterates fully matched groups, accumulating row counts until the limit is satisfied.</figcaption>
202+
</figure>
203+
204+
The implementation in `prune_by_limit` ([row_group_filter.rs]):
205+
206+
```rust
207+
pub fn prune_by_limit(
208+
&mut self,
209+
limit: usize,
210+
rg_metadata: &[RowGroupMetaData],
211+
metrics: &ParquetFileMetrics,
212+
) {
213+
let mut fully_matched_indexes: Vec<usize> = Vec::new();
214+
let mut fully_matched_rows: usize = 0;
215+
216+
for &idx in self.access_plan.row_group_indexes().iter() {
217+
if self.is_fully_matched[idx] {
218+
fully_matched_indexes.push(idx);
219+
fully_matched_rows += rg_metadata[idx].num_rows() as usize;
220+
if fully_matched_rows >= limit {
221+
break;
222+
}
223+
}
224+
}
225+
226+
// Rewrite the plan if we have enough rows
227+
if fully_matched_rows >= limit {
228+
let mut new_plan = ParquetAccessPlan::new_none(rg_metadata.len());
229+
for &idx in &fully_matched_indexes {
230+
new_plan.scan(idx);
231+
}
232+
self.access_plan = new_plan;
233+
}
234+
}
235+
```
236+
237+
Key properties of this algorithm:
238+
239+
- It preserves the original row group ordering
240+
- If fully matched groups don't have enough rows, the plan is unchanged — no harm done
241+
- The cost is minimal: a single pass over the row group list
242+
243+
## Case Study: Alpine Wildlife Query
244+
245+
Let's walk through a concrete example. Given a wildlife tracking dataset with four row groups:
246+
247+
```sql
248+
SELECT * FROM tracking_data
249+
WHERE species LIKE 'Alpine%' AND s >= 50
250+
LIMIT 3
251+
```
252+
253+
| Row Group | Species Range | S Range | State |
254+
|-----------|--------------|---------|-------|
255+
| RG1 | Snow Vole, Brown Bear, Gray Wolf | 7–133 | **Not Matching** (no 'Alpine%') |
256+
| RG2 | Lynx, Red Fox, Alpine Bat | 6–71 | **Partially Matching** |
257+
| RG3 | Alpine Ibex, Alpine Goat, Alpine Sheep | 76–101 | **Fully Matching** |
258+
| RG4 | Mixed species | Mixed | **Partially Matching** |
259+
260+
<figure>
261+
<img src="/blog/images/limit-pruning/before-after.svg" width="80%" alt="Before and after limit pruning comparison"/>
262+
<figcaption>Figure 7: Before limit pruning, RG2 is scanned for zero hits. After limit pruning, only RG3 is scanned.</figcaption>
263+
</figure>
264+
265+
**Before limit pruning**: DataFusion scans RG2 (0 hits — wasted I/O), then RG3 (3 hits, early return). RG2 was decoded entirely for nothing.
266+
267+
**With limit pruning**: The system detects that RG3 has 3 fully matched rows, which satisfies `LIMIT 3`. It rewrites the access plan to scan only RG3, skipping RG2 and RG4 entirely. One row group scanned. Zero waste.
268+
269+
## Observing Limit Pruning via Metrics
270+
271+
DataFusion exposes limit pruning activity through query metrics. When running a query with `EXPLAIN ANALYZE`, you will see entries like:
272+
273+
```
274+
row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched
275+
limit_pruned_row_groups=2 total → 0 matched
276+
```
277+
278+
This tells us:
279+
- 4 row groups were evaluated, 3 survived statistics pruning, 1 was identified as fully matching
280+
- 2 additional row groups were pruned by the limit optimization
281+
282+
## Future Directions
283+
284+
There are two natural extensions of this work:
285+
286+
**Page-Level Limit Pruning**: Today, "fully matched" detection operates at the row group level. If we extend this to use page index statistics, we could stop decoding pages *within* a row group once the limit is met. This would pay dividends for wide row groups where only a few pages hold matching data.
287+
288+
**Row Filter Hints**: Even when a row group is fully matched, the current row filter still evaluates predicates row by row. If we pass the fully matched groups info into the row filter builder, we can skip predicate evaluation entirely for guaranteed groups — saving CPU cycles on predicate evaluation.
289+
290+
## Summary
291+
292+
DataFusion's pruning pipeline trims redundant I/O from the partition level all the way down to individual rows. Limit pruning adds a new step that creates an early exit when fully matched row groups already satisfy the `LIMIT`. The result is fewer row groups scanned, less data decoded, and faster queries.
293+
294+
The key insights are:
295+
1. **Predicate negation** can identify row groups where *all* rows match — not just "some might match"
296+
2. **Row count accumulation** across fully matched groups enables early termination
297+
298+
[Apache DataFusion]: https://datafusion.apache.org/
299+
[row_group_filter.rs]: https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/row_group_filter.rs
Lines changed: 74 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)