Skip to content

Commit 05ba2d3

Browse files
feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) (#1777)
## What issue does this PR close? Partially address #1749. ## Rationale for this change **Background**: This issue was discovered when running Iceberg Java's test suite against our [experimental DataFusion Comet branch that uses iceberg-rust](apache/datafusion-comet#2528). Many failures occurred in `TestMigrateTableAction.java`, which tests reading Parquet files from migrated tables (_e.g.,_ from Hive or Spark) that lack embedded field ID metadata. **Problem**: The Rust ArrowReader was unable to read these files, while Iceberg Java handles them using a position-based fallback where top-level field ID N maps to top-level Parquet column position N-1, and entire columns (including nested content) are projected. ## What changes are included in this PR? This PR implements position-based column projection for Parquet files without field IDs, enabling iceberg-rust to read migrated tables. **Solution**: Implemented fallback projection in `ArrowReader::get_arrow_projection_mask_fallback()` that matches Java's `ParquetSchemaUtil.pruneColumnsFallback()` behavior: - Detects Parquet files without field IDs by checking Arrow schema metadata - Maps top-level field IDs to top-level column positions (field IDs are 1-indexed, positions are 0-indexed) - Uses `ProjectionMask::roots()` to project entire columns including nested content (structs, lists, maps) - Adds field ID metadata to the projected schema for `RecordBatchTransformer` - Supports schema evolution by allowing missing columns (filled with default values by `RecordBatchTransformer`) This implementation now matches Iceberg Java's behavior for reading migrated tables, enabling interoperability with Java-based tooling and workflows. ## Are these changes tested? Yes, comprehensive unit tests were added to verify the fallback path works correctly: - `test_read_parquet_file_without_field_ids` - Basic projection with primitive columns using position-based mapping - `test_read_parquet_without_field_ids_partial_projection` - Project subset of columns - `test_read_parquet_without_field_ids_schema_evolution` - Handle missing columns with NULL values - `test_read_parquet_without_field_ids_multiple_row_groups` - Verify behavior across row group boundaries - `test_read_parquet_without_field_ids_with_struct` - Project structs with nested fields (entire top-level column) - `test_read_parquet_without_field_ids_filter_eliminates_all_rows` - Comet saw a panic when all row groups were filtered out, this reproduces that scenario - `test_read_parquet_without_field_ids_schema_evolution_add_column_in_middle` - Schema evolution with a column in the middle caused a panic at one point All tests verify that behavior matches Iceberg Java's `pruneColumnsFallback()` implementation in `/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java`. --------- Co-authored-by: Renjie Liu <[email protected]>
1 parent ea51429 commit 05ba2d3

File tree

2 files changed

+915
-93
lines changed

2 files changed

+915
-93
lines changed

crates/iceberg/src/arrow/delete_file_loader.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ impl BasicDeleteFileLoader {
6363
data_file_path,
6464
self.file_io.clone(),
6565
false,
66+
None,
6667
)
6768
.await?
6869
.build()?

0 commit comments

Comments
 (0)