HIVE-29473: prevent combining stats between SELECT and LV fields by konstantinb · Pull Request #6331 · apache/hive

konstantinb · 2026-02-21T00:13:07Z

What changes were proposed in this pull request?

HIVE-29473: preventing stats override of select columns with 2+ LVs

This PR fixes namespace collision in LateralViewJoinStatsRule.process() by enforcing strict parent operator boundaries when computing column statistics.

The problem in the existing code:

LateralViewJoinStatsRule passes identical columnExprMap and RowSchema references to StatsUtils.getColStatisticsFromExprMap() for both the SELECT and UDTF branches. Since both branches can have identically-named columns (_col0, _col1, etc.), the utility method incorrectly matches UDTF statistics against SELECT columns.

The fix:

Split RowSchema into selectSchema and udtfSchema using SELECT_TAG/UDTF_TAG boundaries from the LateralViewJoinOperator's column internal names
Build separate selectExprMap and udtfExprMap by filtering the parent's columnExprMap to only include columns present in the respective schema
Pass isolated collections to getColStatisticsFromExprMap() for each branch, ensuring each branch only sees its own columns

Additional changes:

Added unit tests in TestLateralViewJoinStatsRule.java to verify namespace isolation
Added lvj_stats_isolation.q test file demonstrating the bug with single lateral view
Updated .q.out files reflecting corrected statistics estimates

Why are the changes needed?

The bug causes the CBO to combine statistics of completely unrelated columns, leading to incorrect cardinality and data size estimates for downstream operators (Group By, Join, etc.).

When the collision occurs:

The UDTF branch always generates output columns starting from _col0, _col1, etc. The SELECT branch uses original column names in simple cases, but internal names (_col0, _col1) are assigned by:

ReduceSinkOperator (normalizes output columns)
GroupByOperator (outputs aggregated columns with internal names)
genInputSelectForUnion() in SemanticAnalyzer (forces column renaming for UNION queries)

When both branches have identically-named columns (e.g., both have _col0), StatsUtils.getColStatisticsFromExprMap() matches them incorrectly, combining statistics of unrelated columns.

Impact examples:

A Group By that should estimate 2 rows instead estimates 6, because _col0 resolves to the UDTF's expression (NDV=6) rather than the base table's column
Data size estimates can be inflated by orders of magnitude when UDTF's avgColLen overwrites SELECT's smaller values

These incorrect estimates cause the optimizer to choose suboptimal execution plans.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new .q file confirming the new logic;
performed mathematical calculations on the updates to preexisting .q files to confirm better accuracy of new size estimations
extensive volume testing in a private Hive/Hadoop environment

new test ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out
new unit test ql/src/test/org/apache/hadoop/hive/ql/optimizer/stats/annotation/TestLateralViewJoinStatsRule.java

…d corrected.out files

konstantinb · 2026-02-23T18:31:46Z

ql/src/test/results/clientpositive/llap/union26.q.out

                            expressions: _col0 (type: string), _col1 (type: string)
                            outputColumnNames: _col0, _col1
-                            Statistics: Num rows: 500 Data size: 115500 Basic stats: COMPLETE Column stats: COMPLETE
+                            Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE


This is a typical example of LV column stats impacting the data size estimations of SELECT columns:

` Column Naming

Context Column Name Represents avgColLen

LVJ output schema _col0 SELECT's key 2.812

LVJ output schema _col1 SELECT's value 6.812

LVJ output schema _col8 UDTF's exploded element —

UDTF internal stats _col0 array expression input 56.0

The UDTF branch's column generator restarts at 0, so its internal stats use _col0 for the array expression — colliding with SELECT's _col0.

Processing Comparison

Step Original Code Proposed Fix

Expression Map Shared: {_col0, _col1, _col8} Split: SELECT {_col0, _col1}, UDTF {_col8}

Schema Full: [_col0, _col1, _col8] Split by numSelColumns

UDTF lookup for _col0 Looks up _col0 in udtfStats → finds array's _col0 (56.0) _col0 not in udtfExprMap → skipped

UDTF lookup for _col8 _col8 → Column[col], not found in udtfStats _col8 → Column[col], not found in udtfStats

Merge _col0 MAX(2.812, 56.0) = 56.0 No collision → 2.812

Final Column Statistics

Column Original Code Proposed Fix

_col0 avgColLen 56.0 ✗ 2.812 ✓

_col1 avgColLen 6.812 6.812

Per-row total 62.812 bytes 9.624 bytes

Data Size — LVJ Debug Output (500 rows)

Original Code Proposed Fix

Calculation 62.812 × 500 9.624 × 500

Total 31,406 bytes 4,812 bytes

Data Size — EXPLAIN Output (500 rows)

Column Original Code Proposed Fix

key avgColLen 140 ✗ 87 ✓

value avgColLen 91 91

Per-row total 231 bytes 178 bytes

Original Code Proposed Fix

Calculation 231 × 500 178 × 500

Total 115,500 bytes 89,000 bytes

konstantinb · 2026-02-23T21:06:41Z