Parallelize Manifest scanning #683
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of reading 1 manifest at a time, we make use of the MultiFileReader to read multiple files.
The schema of manifests evolves throughout different versions, and is also influenced by the
partition-spec-idof themanifest_fileentry describing it.To deal with this, we construct a set of
global_columnsfor the MultiFileReader, which will map every read file by the field id of the column/field (in case of nested types).This simplifies the transformation logic for the
IcebergManifestFileandIcebergManifestEntrystructs, as we are now in full control of the order and appearance of columns in the result, we can do away with the handrolled mapping we had in place before.(we do still keep the mapping for
partition_fieldsfor now, it can be removed but requires more work)It should also simplify the
GetDataFilepath, as we won't need to manually keep track of the manifest we're currently scanning, but can instead just pull a DataChunk from the reader every time the current one is depleted (same lazy-loading approach, where we'll process 2048 (STANDARD_VECTOR_SIZE) tuples, and only fetch a new chunk when the current one is fully read)NOTE:
This also requires changes to
duckdb-avro, because it wasn't putting outMultiFileColumnDefinitions that the ColumnMapper was happy with (for LIST and MAP)