You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) (#1777)
## What issue does this PR close?
Partially address #1749.
## Rationale for this change
**Background**: This issue was discovered when running Iceberg Java's
test suite against our [experimental DataFusion Comet branch that uses
iceberg-rust](apache/datafusion-comet#2528).
Many failures occurred in `TestMigrateTableAction.java`, which tests
reading Parquet files from migrated tables (_e.g.,_ from Hive or Spark)
that lack embedded field ID metadata.
**Problem**: The Rust ArrowReader was unable to read these files, while
Iceberg Java handles them using a position-based fallback where
top-level field ID N maps to top-level Parquet column position N-1, and
entire columns (including nested content) are projected.
## What changes are included in this PR?
This PR implements position-based column projection for Parquet files
without field IDs, enabling iceberg-rust to read migrated tables.
**Solution**: Implemented fallback projection in
`ArrowReader::get_arrow_projection_mask_fallback()` that matches Java's
`ParquetSchemaUtil.pruneColumnsFallback()` behavior:
- Detects Parquet files without field IDs by checking Arrow schema
metadata
- Maps top-level field IDs to top-level column positions (field IDs are
1-indexed, positions are 0-indexed)
- Uses `ProjectionMask::roots()` to project entire columns including
nested content (structs, lists, maps)
- Adds field ID metadata to the projected schema for
`RecordBatchTransformer`
- Supports schema evolution by allowing missing columns (filled with
default values by `RecordBatchTransformer`)
This implementation now matches Iceberg Java's behavior for reading
migrated tables, enabling interoperability with Java-based tooling and
workflows.
## Are these changes tested?
Yes, comprehensive unit tests were added to verify the fallback path
works correctly:
- `test_read_parquet_file_without_field_ids` - Basic projection with
primitive columns using position-based mapping
- `test_read_parquet_without_field_ids_partial_projection` - Project
subset of columns
- `test_read_parquet_without_field_ids_schema_evolution` - Handle
missing columns with NULL values
- `test_read_parquet_without_field_ids_multiple_row_groups` - Verify
behavior across row group boundaries
- `test_read_parquet_without_field_ids_with_struct` - Project structs
with nested fields (entire top-level column)
- `test_read_parquet_without_field_ids_filter_eliminates_all_rows` -
Comet saw a panic when all row groups were filtered out, this reproduces
that scenario
-
`test_read_parquet_without_field_ids_schema_evolution_add_column_in_middle`
- Schema evolution with a column in the middle caused a panic at one
point
All tests verify that behavior matches Iceberg Java's
`pruneColumnsFallback()` implementation in
`/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java`.
---------
Co-authored-by: Renjie Liu <[email protected]>
0 commit comments