Parallelize Manifest scanning #683

Tishj · 2026-02-03T10:31:14Z

Instead of reading 1 manifest at a time, we make use of the MultiFileReader to read multiple files.

The schema of manifests evolves throughout different versions, and is also influenced by the partition-spec-id of the manifest_file entry describing it.
To deal with this, we construct a set of global_columns for the MultiFileReader, which will map every read file by the field id of the column/field (in case of nested types).

This simplifies the transformation logic for the IcebergManifestFile and IcebergManifestEntry structs, as we are now in full control of the order and appearance of columns in the result, we can do away with the handrolled mapping we had in place before.
(we do still keep the mapping for partition_fields for now, it can be removed but requires more work)

It should also simplify the GetDataFile path, as we won't need to manually keep track of the manifest we're currently scanning, but can instead just pull a DataChunk from the reader every time the current one is depleted (same lazy-loading approach, where we'll process 2048 (STANDARD_VECTOR_SIZE) tuples, and only fetch a new chunk when the current one is fully read)

NOTE:
This also requires changes to duckdb-avro, because it wasn't putting out MultiFileColumnDefinitions that the ColumnMapper was happy with (for LIST and MAP)

…our own

…ieldIdToTypeMapping' method

… the MultiFileReader

… the path, and hoist the 'manifest_file' struct for better pushdown of stats (like file_size)

…ce, since we load the scan with all manifests, we just scan STANDARD_VECTOR_SIZE entries at a time

Tishj added 12 commits February 2, 2026 21:41

WIP: using the multi file reader for mapping, instead of handrolling …

36bb3c1

…our own

first time succesfully compiling and running simple test

08918e0

add more default expressions, for all remaining optional fields

a587576

passing almost all tests now

262b653

provide the correct snapshot, and support nested columns in the 'GetF…

8ee0c2b

…ieldIdToTypeMapping' method

fix issues

50ab7d8

get rid of old manifest_file transform logic

7aa6a87

remove the 'vector_mapping' variable, mapping is now done entirely by…

a1b5424

… the MultiFileReader

scan all manifests in a single MultiFileReader (delay the creation of…

0fe707a

… the path, and hoist the 'manifest_file' struct for better pushdown of stats (like file_size)

WIP: virtual columns??

c60fe6f

succesfully injected virtual columns in the manifest scan output

f2f6d30

simplify the GetDataFile path, we only need to initialize the scan on…

d077149

…ce, since we load the scan with all manifests, we just scan STANDARD_VECTOR_SIZE entries at a time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize Manifest scanning #683

Parallelize Manifest scanning #683

Uh oh!

Tishj commented Feb 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Parallelize Manifest scanning #683

Are you sure you want to change the base?

Parallelize Manifest scanning #683

Uh oh!

Conversation

Tishj commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tishj commented Feb 3, 2026 •

edited

Loading