HIVE-29473: prevent combining stats between SELECT and LV fields#6331
HIVE-29473: prevent combining stats between SELECT and LV fields#6331konstantinb wants to merge 3 commits intoapache:masterfrom
Conversation
…d corrected.out files
| expressions: _col0 (type: string), _col1 (type: string) | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 500 Data size: 115500 Basic stats: COMPLETE Column stats: COMPLETE | ||
| Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
This is a typical example of LV column stats impacting the data size estimations of SELECT columns:
` Column Naming
| Context | Column Name | Represents | avgColLen |
|---|---|---|---|
| LVJ output schema | _col0 | SELECT's key | 2.812 |
| LVJ output schema | _col1 | SELECT's value | 6.812 |
| LVJ output schema | _col8 | UDTF's exploded element | — |
| UDTF internal stats | _col0 | array expression input | 56.0 |
The UDTF branch's column generator restarts at 0, so its internal stats use _col0 for the array expression — colliding with SELECT's _col0.
Processing Comparison
| Step | Original Code | Proposed Fix |
|---|---|---|
| Expression Map | Shared: {_col0, _col1, _col8} | Split: SELECT {_col0, _col1}, UDTF {_col8} |
| Schema | Full: [_col0, _col1, _col8] | Split by numSelColumns |
| UDTF lookup for _col0 | Looks up _col0 in udtfStats → finds array's _col0 (56.0) | _col0 not in udtfExprMap → skipped |
| UDTF lookup for _col8 | _col8 → Column[col], not found in udtfStats | _col8 → Column[col], not found in udtfStats |
| Merge _col0 | MAX(2.812, 56.0) = 56.0 | No collision → 2.812 |
Final Column Statistics
| Column | Original Code | Proposed Fix |
|---|---|---|
| _col0 avgColLen | 56.0 ✗ | 2.812 ✓ |
| _col1 avgColLen | 6.812 | 6.812 |
| Per-row total | 62.812 bytes | 9.624 bytes |
Data Size — LVJ Debug Output (500 rows)
| Original Code | Proposed Fix | |
|---|---|---|
| Calculation | 62.812 × 500 | 9.624 × 500 |
| Total | 31,406 bytes | 4,812 bytes |
Data Size — EXPLAIN Output (500 rows)
| Column | Original Code | Proposed Fix |
|---|---|---|
| key avgColLen | 140 ✗ | 87 ✓ |
| value avgColLen | 91 | 91 |
| Per-row total | 231 bytes | 178 bytes |
| Original Code | Proposed Fix | |
|---|---|---|
| Calculation | 231 × 500 | 178 × 500 |
| Total | 115,500 bytes | 89,000 bytes |
| minReductionHashAggr: 0.4 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1, _col2 | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: zz | ||
| sort order: ++ | ||
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string) | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| minReductionHashAggr: 0.4 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1, _col2 | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: zz | ||
| sort order: ++ | ||
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string) | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| keys: KEY._col0 (type: string), KEY._col1 (type: string) | ||
| mode: mergepartial | ||
| outputColumnNames: _col0, _col1, _col2 | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE | ||
| File Output Operator | ||
| compressed: false | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| Group By Operator | ||
| aggregations: count() | ||
| keys: _col0 (type: string) | ||
| minReductionHashAggr: 0.99 |
There was a problem hiding this comment.
original code output:
minReductionHashAggr: 0.4
| minReductionHashAggr: 0.99 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: z | ||
| sort order: + | ||
| Map-reduce partition columns: _col0 (type: string) | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| Group By Operator | ||
| aggregations: count() | ||
| keys: _col0 (type: string) | ||
| minReductionHashAggr: 0.99 |
There was a problem hiding this comment.
original code output:
minReductionHashAggr: 0.4
| minReductionHashAggr: 0.99 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: z | ||
| sort order: + | ||
| Map-reduce partition columns: _col0 (type: string) | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| keys: KEY._col0 (type: string) | ||
| mode: mergepartial | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE | ||
| File Output Operator | ||
| compressed: false | ||
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
|
Original output for the new .q file: lvj_stats_isolation.q.out.txt |
|



What changes were proposed in this pull request?
HIVE-29473: preventing stats override of select columns with 2+ LVs
This PR fixes namespace collision in LateralViewJoinStatsRule.process() by enforcing strict parent operator boundaries when computing column statistics.
The problem in the existing code:
LateralViewJoinStatsRule passes identical columnExprMap and RowSchema references to StatsUtils.getColStatisticsFromExprMap() for both the SELECT and UDTF branches. Since both branches can have identically-named columns (_col0, _col1, etc.), the utility method incorrectly matches UDTF statistics against SELECT columns.
The fix:
Additional changes:
Why are the changes needed?
The bug causes the CBO to combine statistics of completely unrelated columns, leading to incorrect cardinality and data size estimates for downstream operators (Group By, Join, etc.).
When the collision occurs:
The UDTF branch always generates output columns starting from _col0, _col1, etc. The SELECT branch uses original column names in simple cases, but internal names (_col0, _col1) are assigned by:
When both branches have identically-named columns (e.g., both have _col0), StatsUtils.getColStatisticsFromExprMap() matches them incorrectly, combining statistics of unrelated columns.
Impact examples:
These incorrect estimates cause the optimizer to choose suboptimal execution plans.
Does this PR introduce any user-facing change?
No
How was this patch tested?