Skip to content

HIVE-29637: Incorrect Results for NULL Predicate on Partition Column#6515

Open
deniskuzZ wants to merge 3 commits into
apache:masterfrom
deniskuzZ:HIVE-29637
Open

HIVE-29637: Incorrect Results for NULL Predicate on Partition Column#6515
deniskuzZ wants to merge 3 commits into
apache:masterfrom
deniskuzZ:HIVE-29637

Conversation

@deniskuzZ
Copy link
Copy Markdown
Member

@deniskuzZ deniskuzZ commented May 29, 2026

…n Column

What changes were proposed in this pull request?

IcebergTableUtil.makeSpecFromName() — a new helper that builds a Hive-compatible partition spec map from Iceberg's PartitionData. When a partition field value is null, it maps it to an actual NULL instead of the literal string "null".

Why are the changes needed?

Queries with WHERE <partition_col> IS NULL on Iceberg tables returned incorrect results (empty result set).

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn test -Dtest=TestIcebergCliDriver -Dqfile=iceberg_isnull_partition_pruning.q

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect results for WHERE <partition_col> IS NULL on Iceberg tables by ensuring null partition values are represented as actual Java null during partition predicate evaluation, and adds a regression qtest to cover the scenario.

Changes:

  • Add IcebergTableUtil.makeSpecFromName(...) to build a partition spec map from an Iceberg partition path while translating Iceberg null partition values to Java null.
  • Update HiveIcebergStorageHandler#getPartitionsByExpr to use the new helper when constructing DummyPartition specs for expression evaluation.
  • Add a new positive qtest (iceberg_isnull_partition_pruning.q + expected output) validating IS NULL partition pruning behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java Adds a helper for parsing Iceberg partition paths into a spec map with correct null handling.
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java Switches partition spec construction in getPartitionsByExpr to the new helper to fix IS NULL results.
iceberg/iceberg-handler/src/test/queries/positive/iceberg_isnull_partition_pruning.q Adds a regression query file covering WHERE str_col IS NULL on an Iceberg-partitioned table.
iceberg/iceberg-handler/src/test/results/positive/iceberg_isnull_partition_pruning.q.out Adds the expected output for the new regression test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some high level comments/questions cause I am not very familiar with the modified APIs.

String partName = spec.partitionToPath(partitionData);

Map<String, String> partSpecMap = Maps.newLinkedHashMap();
Warehouse.makeSpecFromName(partSpecMap, new Path(partName), null);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more calls to Warehouse#makeSpecFromName methods in IcebergTableUtil#convertNameToMetastorePartition, IcebergQueryCompactor, and potentially other places as well. Do we need to update them as well?

Copy link
Copy Markdown
Member Author

@deniskuzZ deniskuzZ Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other call sites (e.g. IcebergTableUtil#convertNameToMetastorePartition) don't have access to partitionData — they only receive the partition path string. Without the underlying partition data, there's no way to disambiguate whether "null" in the path originated from a true NULL value or from a literal string "null". The fix here works precisely because we have partitionData available to check the actual value. If those other paths turn out to be affected, they'd need a different approach to obtain that context.

Note: IcebergQueryCompactor should be fine — it deals with compaction where NULL filtering semantics aren't in play

}

/**
* Parses an Iceberg partition path into a Hive-compatible spec map.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method somewhat "overrides" the default behavior of Warehouse.makeSpecFromName but it's unclear why this specificity is required for Iceberg especially since we want to create a "Hive-compatible" spec map.

Copy link
Copy Markdown
Member Author

@deniskuzZ deniskuzZ Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is in how the partition path string is generated and then parsed back. spec.partitionToPath(partitionData) uses Iceberg's toHumanString() to convert partition values to path segments. For NULL values, toHumanString() produces the string "null". So the path looks like col=null. When Warehouse.makeSpecFromName parses this path back into a spec map, it just sees the literal string "null" as the value — it has no way to know whether this came from an actual NULL or from a legitimate string value "null". It puts "null" (the string) into the spec map rather than NULL.

The fix uses partitionData to check whether the value is actually NULL at the Iceberg level, and if so, puts NULL into the spec map directly. We need this because toHumanString() is lossy — it conflates NULL and the string "null" into the same path representation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional details. I have two small follow-up questions:

  1. How do we avoid this problem in non-Iceberg tables?
  2. Do we have test coverage for partitioning where data contain both the "null" string and actual NULL?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this is an Iceberg-specific problem.
  2. added

Copy link
Copy Markdown
Member

@zabetak zabetak Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-Iceberg tables I assume that actual NULL values go into the default partition (hive.exec.default.partition.name). I am trying to understand if when we built the spec we should have the value of hive.exec.default.partition.name or null and if it makes any difference.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping null to HIVE_DEFAULT_PARTITION didn't work in PCR and filter was replaced with false.
unlike prunePartitionNames

  if (partitionValue.equals(defaultPartitionName)) {
    convertedValues.add(null); // Null for default partition.
  }  

evalExprWithPart doesn't have null-awareness

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that you refer to org.apache.hadoop.hive.ql.optimizer.ppr.PartExprEvalUtils#evalExprWithPart. If the latter does not have null-awareness then it means that the wrong result issue affects all partitioned tables and not only Iceberg.

I tried the following test case on master and it seems to confirm my hypothesis.

create table pcr_t1 (key string, value string) partitioned by (ds string);
INSERT INTO pcr_t1 PARTITION (ds) SELECT 'A', 'V1', '2000-04-08' ;
INSERT INTO pcr_t1 PARTITION (ds) SELECT 'B', 'V2', 'null';
INSERT INTO pcr_t1 PARTITION (ds) SELECT 'C', 'V3', null;
select key, value, ds from pcr_t1 where ds is null;

Should we treat this as apart?

Copy link
Copy Markdown
Member Author

@deniskuzZ deniskuzZ Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 45662e4, @zabetak please re-check

@deniskuzZ deniskuzZ changed the title HIVE-29637: Iceberg: Incorrect Results for NULL Predicate on Partitio… HIVE-29637: Incorrect Results for NULL Predicate on Partitio… Jun 2, 2026
@deniskuzZ deniskuzZ changed the title HIVE-29637: Incorrect Results for NULL Predicate on Partitio… HIVE-29637: Incorrect Results for NULL Predicate on Partition Column Jun 2, 2026
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 3, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants