Commit 05ba2d3
feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) (#1777)
## What issue does this PR close?
Partially address #1749.
## Rationale for this change
**Background**: This issue was discovered when running Iceberg Java's
test suite against our [experimental DataFusion Comet branch that uses
iceberg-rust](apache/datafusion-comet#2528).
Many failures occurred in `TestMigrateTableAction.java`, which tests
reading Parquet files from migrated tables (_e.g.,_ from Hive or Spark)
that lack embedded field ID metadata.
**Problem**: The Rust ArrowReader was unable to read these files, while
Iceberg Java handles them using a position-based fallback where
top-level field ID N maps to top-level Parquet column position N-1, and
entire columns (including nested content) are projected.
## What changes are included in this PR?
This PR implements position-based column projection for Parquet files
without field IDs, enabling iceberg-rust to read migrated tables.
**Solution**: Implemented fallback projection in
`ArrowReader::get_arrow_projection_mask_fallback()` that matches Java's
`ParquetSchemaUtil.pruneColumnsFallback()` behavior:
- Detects Parquet files without field IDs by checking Arrow schema
metadata
- Maps top-level field IDs to top-level column positions (field IDs are
1-indexed, positions are 0-indexed)
- Uses `ProjectionMask::roots()` to project entire columns including
nested content (structs, lists, maps)
- Adds field ID metadata to the projected schema for
`RecordBatchTransformer`
- Supports schema evolution by allowing missing columns (filled with
default values by `RecordBatchTransformer`)
This implementation now matches Iceberg Java's behavior for reading
migrated tables, enabling interoperability with Java-based tooling and
workflows.
## Are these changes tested?
Yes, comprehensive unit tests were added to verify the fallback path
works correctly:
- `test_read_parquet_file_without_field_ids` - Basic projection with
primitive columns using position-based mapping
- `test_read_parquet_without_field_ids_partial_projection` - Project
subset of columns
- `test_read_parquet_without_field_ids_schema_evolution` - Handle
missing columns with NULL values
- `test_read_parquet_without_field_ids_multiple_row_groups` - Verify
behavior across row group boundaries
- `test_read_parquet_without_field_ids_with_struct` - Project structs
with nested fields (entire top-level column)
- `test_read_parquet_without_field_ids_filter_eliminates_all_rows` -
Comet saw a panic when all row groups were filtered out, this reproduces
that scenario
-
`test_read_parquet_without_field_ids_schema_evolution_add_column_in_middle`
- Schema evolution with a column in the middle caused a panic at one
point
All tests verify that behavior matches Iceberg Java's
`pruneColumnsFallback()` implementation in
`/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java`.
---------
Co-authored-by: Renjie Liu <[email protected]>1 parent ea51429 commit 05ba2d3
2 files changed
+915
-93
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
| 66 | + | |
66 | 67 | | |
67 | 68 | | |
68 | 69 | | |
| |||
0 commit comments