@@ -132,32 +132,35 @@ impl RowGroupAccessPlanFilter {
132132 /// | +-----------------------------------+-----------------------------+ |
133133 /// +-----------------------------------------------------------------------+
134134 ///
135- /// # Example with Statistics Truncation and NOT Inversion
135+ /// ### Identification of Fully Matching Row Groups
136136 ///
137- /// When statistics are truncated to length 6 (e.g., `statistics_truncate_length = 6`),
138- /// the min/max values become:
137+ /// DataFusion identifies row groups where ALL rows satisfy the filter by inverting the
138+ /// predicate and checking if statistics prove the inverted version is false for the group.
139139 ///
140- /// ```text
141- /// Row group 3: species_min="Alpine", species_max="Alpine" (truncated from "Alpine Ibex"/"Alpine Sheep")
142- /// s_min=76, s_max=101
143- /// ```
140+ /// For example, prefix matches like `species LIKE 'Alpine%'` are pruned using ranges:
141+ /// 1. Candidate Range: `species >= 'Alpine' AND species < 'Alpinf'`
142+ /// 2. Inverted Condition (to prove full match): `species < 'Alpine' OR species >= 'Alpinf'`
143+ /// 3. Statistical Evaluation (check if any row *could* satisfy the inverted condition):
144+ /// `min < 'Alpine' OR max >= 'Alpinf'`
144145 ///
145- /// To identify this as fully matching, the system uses NOT inversion:
146- /// 1. Original predicate: `species LIKE 'Alpine%' AND s >= 50`
147- /// 2. Inverted predicate: `NOT (species LIKE 'Alpine%' AND s >= 50)`
148- /// Simplified to: `species NOT LIKE 'Alpine%' OR s < 50`
149- /// 3. Pruning predicate generated:
150- /// `(species_min NOT LIKE 'Alpine%' OR species_max NOT LIKE 'Alpine%') OR s_min < 50`
146+ /// If this evaluation is **false**, it proves no row can fail the original filter,
147+ /// so the row group is **FULLY MATCHING**.
151148 ///
152- /// For row group 3 with truncated stats:
153- /// - Evaluating `species_min NOT LIKE 'Alpine%'`: `"A" NOT LIKE 'Alpine%'` = `false`
154- /// - Evaluating `species_max NOT LIKE 'Alpine%'`: `"A" NOT LIKE 'Alpine%'` = `false`
155- /// - Evaluating `s_min < 50`: `76 < 50` = `false`
156- /// - Final result: `(false OR false) OR false` = `false`
149+ /// ### Impact of Statistics Truncation
157150 ///
158- /// Since the inverted predicate would prune this row group (returns false), it means
159- /// no rows in this group could possibly satisfy the inverted predicate.
160- /// Therefore, all rows in this group must match the original predicate, making it fully matched
151+ /// The precision of pruning depends on the metadata quality. Truncated statistics
152+ /// may prevent the system from proving a full match.
153+ ///
154+ /// **Example**: `WHERE species LIKE 'Alpine%'` (Target range: `['Alpine', 'Alpinf')`)
155+ ///
156+ /// | Truncation Length | min / max | Inverted Evaluation | Status |
157+ /// |-------------------|---------------------|---------------------------------------------------------------------|------------------------|
158+ /// | **Length 6** | `Alpine` / `Alpine` | `"Alpine" < "Alpine" (F) OR "Alpine" >= "Alpinf" (F)` -> **false** | **FULLY MATCHING** |
159+ /// | **Length 3** | `Alp` / `Alq` | `"Alp" < "Alpine" (T) OR "Alq" >= "Alpinf" (T)` -> **true** | **PARTIALLY MATCHING** |
160+ ///
161+ /// Even though Row Group 3 only contains matching rows, truncation to length 3 makes
162+ /// the statistics `[Alp, Alq]` too broad to prove it (they could include "Alpha").
163+ /// The system must conservatively scan the group.
161164 ///
162165 /// Without limit pruning: Scan Partition 2 → Partition 3 → Partition 4 (until limit reached)
163166 /// With limit pruning: If Partition 3 contains enough rows to satisfy the limit,
0 commit comments