Skip to content

Conversation

@Tishj
Copy link
Collaborator

@Tishj Tishj commented Feb 3, 2026

Instead of reading 1 manifest at a time, we make use of the MultiFileReader to read multiple files.

The schema of manifests evolves throughout different versions, and is also influenced by the partition-spec-id of the manifest_file entry describing it.
To deal with this, we construct a set of global_columns for the MultiFileReader, which will map every read file by the field id of the column/field (in case of nested types).

This simplifies the transformation logic for the IcebergManifestFile and IcebergManifestEntry structs, as we are now in full control of the order and appearance of columns in the result, we can do away with the handrolled mapping we had in place before.
(we do still keep the mapping for partition_fields for now, it can be removed but requires more work)

It should also simplify the GetDataFile path, as we won't need to manually keep track of the manifest we're currently scanning, but can instead just pull a DataChunk from the reader every time the current one is depleted (same lazy-loading approach, where we'll process 2048 (STANDARD_VECTOR_SIZE) tuples, and only fetch a new chunk when the current one is fully read)

NOTE:
This also requires changes to duckdb-avro, because it wasn't putting out MultiFileColumnDefinitions that the ColumnMapper was happy with (for LIST and MAP)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant