Commit 68af86e
feat(reader): Add PartitionSpec support to FileScanTask and RecordBatchTransformer (apache#1821)
Partially address apache#1749.
This PR adds partition spec handling to `FileScanTask` and
`RecordBatchTransformer` to correctly implement the Iceberg spec's
"Column Projection" rules for fields "not present" in data files.
Prior to this PR, `iceberg-rust`'s `FileScanTask` had no mechanism to
pass partition information to `RecordBatchTransformer`, causing two
issues:
1. **Incorrect handling of bucket partitioning**: Couldn't distinguish
identity transforms (which should use partition metadata constants) from
non-identity transforms like bucket/truncate/year/month (which must read
from data file). For example, `bucket(4, id)` stores
`id_bucket = 2` (bucket number) in partition metadata, but actual `id`
values (100, 200, 300) are only in the data file. iceberg-rust was
incorrectly treating bucket-partitioned source columns as constants,
breaking runtime filtering and returning incorrect query results.
2. **Field ID conflicts in add_files scenarios**: When importing Hive
tables via `add_files`, partition columns could have field IDs
conflicting with Parquet data columns. Example: Parquet has
field_id=1→"name", but Iceberg expects field_id=1→"id" (partition). Per
spec, the
correct field is "not present" and requires name mapping fallback.
Per the Iceberg spec
(https://iceberg.apache.org/spec/#column-projection), when a field ID is
"not present" in a data file, it must be resolved using these rules:
1. Return the value from partition metadata if an **Identity Transform**
exists
2. Use `schema.name-mapping.default` metadata to map field id to columns
without field id
3. Return the default value if it has a defined `initial-default`
4. Return null in all other cases
**Why this matters:**
- **Identity transforms** (e.g., `identity(dept)`) store actual column
values in partition metadata that can be used as constants without
reading the data file
- **Non-identity transforms** (e.g., `bucket(4, id)`, `day(timestamp)`)
store transformed values in partition metadata (e.g., bucket number 2,
not the actual `id` values 100, 200, 300) and must read source columns
from the data file
1. **Added partition fields to `FileScanTask`** (`scan/task.rs`):
- `partition: Option<Struct>` - Partition data from manifest entry
- `partition_spec: Option<Arc<PartitionSpec>>` - For transform-aware
constant detection
- `name_mapping: Option<Arc<NameMapping>>` - Name mapping from table
metadata
2. **Implemented `constants_map()` function**
(`arrow/record_batch_transformer.rs`):
- Replicates Java's `PartitionUtil.constantsMap()` behavior
- Only includes fields where transform is `Transform::Identity`
- Used to determine which fields use partition metadata constants vs.
reading from data files
3. **Enhanced `RecordBatchTransformer`**
(`arrow/record_batch_transformer.rs`):
- Added `build_with_partition_data()` method to accept partition spec,
partition data, and name mapping
- Implements all 4 spec rules for column resolution with
identity-transform awareness
- Detects field ID conflicts by verifying both field ID AND name match
- Falls back to name mapping when field IDs are missing/conflicting
(spec rule risingwavelabs#2)
4. **Updated `ArrowReader`** (`arrow/reader.rs`):
- Uses `build_with_partition_data()` when partition information is
available
- Falls back to `build()` when not available
5. **Updated manifest entry processing** (`scan/context.rs`):
- Populates partition fields in `FileScanTask` from manifest entry data
1. **`bucket_partitioning_reads_source_column_from_file`** - Verifies
that bucket-partitioned source columns are read from data files (not
treated as constants from partition metadata)
2. **`identity_partition_uses_constant_from_metadata`** - Verifies that
identity-transformed fields correctly use partition metadata constants
3. **`test_bucket_partitioning_with_renamed_source_column`** - Verifies
field-ID-based mapping works despite column rename
4. **`add_files_partition_columns_without_field_ids`** - Verifies name
mapping resolution for Hive table imports without field IDs (spec rule
5. **`add_files_with_true_field_id_conflict`** - Verifies correct field
ID conflict detection with name mapping fallback (spec rule risingwavelabs#2)
6. **`test_all_four_spec_rules`** - Integration test verifying all 4
spec rules work together
Yes, there are 6 new unit tests covering all 4 Iceberg spec rules. This
also resolved approximately 50 Iceberg Java tests when running with
DataFusion Comet's experimental
apache/datafusion-comet#2528 PR.
---------
Co-authored-by: Renjie Liu <[email protected]>1 parent 4bf15b9 commit 68af86e
File tree
9 files changed
+1250
-94
lines changed- crates/iceberg/src
- arrow
- scan
- spec/manifest
9 files changed
+1250
-94
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
886 | 886 | | |
887 | 887 | | |
888 | 888 | | |
| 889 | + | |
| 890 | + | |
| 891 | + | |
889 | 892 | | |
890 | 893 | | |
891 | 894 | | |
| |||
902 | 905 | | |
903 | 906 | | |
904 | 907 | | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
905 | 911 | | |
906 | 912 | | |
907 | 913 | | |
| |||
918 | 924 | | |
919 | 925 | | |
920 | 926 | | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
921 | 930 | | |
922 | 931 | | |
923 | 932 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
85 | | - | |
| 85 | + | |
86 | 86 | | |
87 | 87 | | |
88 | 88 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
324 | 324 | | |
325 | 325 | | |
326 | 326 | | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
327 | 330 | | |
328 | 331 | | |
329 | 332 | | |
| |||
341 | 344 | | |
342 | 345 | | |
343 | 346 | | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
344 | 350 | | |
345 | 351 | | |
346 | 352 | | |
| |||
356 | 362 | | |
357 | 363 | | |
358 | 364 | | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
359 | 368 | | |
360 | 369 | | |
361 | 370 | | |
| |||
0 commit comments