Skip to content

Commit 6c515b2

Browse files
committed
update doc
1 parent d2b84d4 commit 6c515b2

File tree

1 file changed

+24
-21
lines changed

1 file changed

+24
-21
lines changed

datafusion/datasource-parquet/src/row_group_filter.rs

Lines changed: 24 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -132,32 +132,35 @@ impl RowGroupAccessPlanFilter {
132132
/// | +-----------------------------------+-----------------------------+ |
133133
/// +-----------------------------------------------------------------------+
134134
///
135-
/// # Example with Statistics Truncation and NOT Inversion
135+
/// ### Identification of Fully Matching Row Groups
136136
///
137-
/// When statistics are truncated to length 6 (e.g., `statistics_truncate_length = 6`),
138-
/// the min/max values become:
137+
/// DataFusion identifies row groups where ALL rows satisfy the filter by inverting the
138+
/// predicate and checking if statistics prove the inverted version is false for the group.
139139
///
140-
/// ```text
141-
/// Row group 3: species_min="Alpine", species_max="Alpine" (truncated from "Alpine Ibex"/"Alpine Sheep")
142-
/// s_min=76, s_max=101
143-
/// ```
140+
/// For example, prefix matches like `species LIKE 'Alpine%'` are pruned using ranges:
141+
/// 1. Candidate Range: `species >= 'Alpine' AND species < 'Alpinf'`
142+
/// 2. Inverted Condition (to prove full match): `species < 'Alpine' OR species >= 'Alpinf'`
143+
/// 3. Statistical Evaluation (check if any row *could* satisfy the inverted condition):
144+
/// `min < 'Alpine' OR max >= 'Alpinf'`
144145
///
145-
/// To identify this as fully matching, the system uses NOT inversion:
146-
/// 1. Original predicate: `species LIKE 'Alpine%' AND s >= 50`
147-
/// 2. Inverted predicate: `NOT (species LIKE 'Alpine%' AND s >= 50)`
148-
/// Simplified to: `species NOT LIKE 'Alpine%' OR s < 50`
149-
/// 3. Pruning predicate generated:
150-
/// `(species_min NOT LIKE 'Alpine%' OR species_max NOT LIKE 'Alpine%') OR s_min < 50`
146+
/// If this evaluation is **false**, it proves no row can fail the original filter,
147+
/// so the row group is **FULLY MATCHING**.
151148
///
152-
/// For row group 3 with truncated stats:
153-
/// - Evaluating `species_min NOT LIKE 'Alpine%'`: `"A" NOT LIKE 'Alpine%'` = `false`
154-
/// - Evaluating `species_max NOT LIKE 'Alpine%'`: `"A" NOT LIKE 'Alpine%'` = `false`
155-
/// - Evaluating `s_min < 50`: `76 < 50` = `false`
156-
/// - Final result: `(false OR false) OR false` = `false`
149+
/// ### Impact of Statistics Truncation
157150
///
158-
/// Since the inverted predicate would prune this row group (returns false), it means
159-
/// no rows in this group could possibly satisfy the inverted predicate.
160-
/// Therefore, all rows in this group must match the original predicate, making it fully matched
151+
/// The precision of pruning depends on the metadata quality. Truncated statistics
152+
/// may prevent the system from proving a full match.
153+
///
154+
/// **Example**: `WHERE species LIKE 'Alpine%'` (Target range: `['Alpine', 'Alpinf')`)
155+
///
156+
/// | Truncation Length | min / max | Inverted Evaluation | Status |
157+
/// |-------------------|---------------------|---------------------------------------------------------------------|------------------------|
158+
/// | **Length 6** | `Alpine` / `Alpine` | `"Alpine" < "Alpine" (F) OR "Alpine" >= "Alpinf" (F)` -> **false** | **FULLY MATCHING** |
159+
/// | **Length 3** | `Alp` / `Alq` | `"Alp" < "Alpine" (T) OR "Alq" >= "Alpinf" (T)` -> **true** | **PARTIALLY MATCHING** |
160+
///
161+
/// Even though Row Group 3 only contains matching rows, truncation to length 3 makes
162+
/// the statistics `[Alp, Alq]` too broad to prove it (they could include "Alpha").
163+
/// The system must conservatively scan the group.
161164
///
162165
/// Without limit pruning: Scan Partition 2 → Partition 3 → Partition 4 (until limit reached)
163166
/// With limit pruning: If Partition 3 contains enough rows to satisfy the limit,

0 commit comments

Comments
 (0)