HIVE-29637: Incorrect Results for NULL Predicate on Partition Column#6515
HIVE-29637: Incorrect Results for NULL Predicate on Partition Column#6515deniskuzZ wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes incorrect results for WHERE <partition_col> IS NULL on Iceberg tables by ensuring null partition values are represented as actual Java null during partition predicate evaluation, and adds a regression qtest to cover the scenario.
Changes:
- Add
IcebergTableUtil.makeSpecFromName(...)to build a partition spec map from an Iceberg partition path while translating Iceberg null partition values to Javanull. - Update
HiveIcebergStorageHandler#getPartitionsByExprto use the new helper when constructingDummyPartitionspecs for expression evaluation. - Add a new positive qtest (
iceberg_isnull_partition_pruning.q+ expected output) validatingIS NULLpartition pruning behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java |
Adds a helper for parsing Iceberg partition paths into a spec map with correct null handling. |
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java |
Switches partition spec construction in getPartitionsByExpr to the new helper to fix IS NULL results. |
iceberg/iceberg-handler/src/test/queries/positive/iceberg_isnull_partition_pruning.q |
Adds a regression query file covering WHERE str_col IS NULL on an Iceberg-partitioned table. |
iceberg/iceberg-handler/src/test/results/positive/iceberg_isnull_partition_pruning.q.out |
Adds the expected output for the new regression test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
zabetak
left a comment
There was a problem hiding this comment.
Left some high level comments/questions cause I am not very familiar with the modified APIs.
| String partName = spec.partitionToPath(partitionData); | ||
|
|
||
| Map<String, String> partSpecMap = Maps.newLinkedHashMap(); | ||
| Warehouse.makeSpecFromName(partSpecMap, new Path(partName), null); |
There was a problem hiding this comment.
There are more calls to Warehouse#makeSpecFromName methods in IcebergTableUtil#convertNameToMetastorePartition, IcebergQueryCompactor, and potentially other places as well. Do we need to update them as well?
There was a problem hiding this comment.
The other call sites (e.g. IcebergTableUtil#convertNameToMetastorePartition) don't have access to partitionData — they only receive the partition path string. Without the underlying partition data, there's no way to disambiguate whether "null" in the path originated from a true NULL value or from a literal string "null". The fix here works precisely because we have partitionData available to check the actual value. If those other paths turn out to be affected, they'd need a different approach to obtain that context.
Note: IcebergQueryCompactor should be fine — it deals with compaction where NULL filtering semantics aren't in play
| } | ||
|
|
||
| /** | ||
| * Parses an Iceberg partition path into a Hive-compatible spec map. |
There was a problem hiding this comment.
The method somewhat "overrides" the default behavior of Warehouse.makeSpecFromName but it's unclear why this specificity is required for Iceberg especially since we want to create a "Hive-compatible" spec map.
There was a problem hiding this comment.
The issue is in how the partition path string is generated and then parsed back. spec.partitionToPath(partitionData) uses Iceberg's toHumanString() to convert partition values to path segments. For NULL values, toHumanString() produces the string "null". So the path looks like col=null. When Warehouse.makeSpecFromName parses this path back into a spec map, it just sees the literal string "null" as the value — it has no way to know whether this came from an actual NULL or from a legitimate string value "null". It puts "null" (the string) into the spec map rather than NULL.
The fix uses partitionData to check whether the value is actually NULL at the Iceberg level, and if so, puts NULL into the spec map directly. We need this because toHumanString() is lossy — it conflates NULL and the string "null" into the same path representation.
There was a problem hiding this comment.
Thanks for the additional details. I have two small follow-up questions:
- How do we avoid this problem in non-Iceberg tables?
- Do we have test coverage for partitioning where data contain both the
"null"string and actualNULL?
There was a problem hiding this comment.
- this is an Iceberg-specific problem.
- added
There was a problem hiding this comment.
For non-Iceberg tables I assume that actual NULL values go into the default partition (hive.exec.default.partition.name). I am trying to understand if when we built the spec we should have the value of hive.exec.default.partition.name or null and if it makes any difference.
There was a problem hiding this comment.
Mapping null to HIVE_DEFAULT_PARTITION didn't work in PCR and filter was replaced with false.
unlike prunePartitionNames
if (partitionValue.equals(defaultPartitionName)) {
convertedValues.add(null); // Null for default partition.
}
evalExprWithPart doesn't have null-awareness
There was a problem hiding this comment.
I assume that you refer to org.apache.hadoop.hive.ql.optimizer.ppr.PartExprEvalUtils#evalExprWithPart. If the latter does not have null-awareness then it means that the wrong result issue affects all partitioned tables and not only Iceberg.
I tried the following test case on master and it seems to confirm my hypothesis.
create table pcr_t1 (key string, value string) partitioned by (ds string);
INSERT INTO pcr_t1 PARTITION (ds) SELECT 'A', 'V1', '2000-04-08' ;
INSERT INTO pcr_t1 PARTITION (ds) SELECT 'B', 'V2', 'null';
INSERT INTO pcr_t1 PARTITION (ds) SELECT 'C', 'V3', null;
select key, value, ds from pcr_t1 where ds is null;Should we treat this as apart?
|



…n Column
What changes were proposed in this pull request?
IcebergTableUtil.makeSpecFromName() — a new helper that builds a Hive-compatible partition spec map from Iceberg's PartitionData. When a partition field value is null, it maps it to an actual NULL instead of the literal string "null".
Why are the changes needed?
Queries with WHERE <partition_col> IS NULL on Iceberg tables returned incorrect results (empty result set).
Does this PR introduce any user-facing change?
No
How was this patch tested?
mvn test -Dtest=TestIcebergCliDriver -Dqfile=iceberg_isnull_partition_pruning.q