Skip to content

fix(table): AddFiles: close file after usage and parallelize#799

Open
starpact wants to merge 1 commit intoapache:mainfrom
starpact:main
Open

fix(table): AddFiles: close file after usage and parallelize#799
starpact wants to merge 1 commit intoapache:mainfrom
starpact:main

Conversation

@starpact
Copy link
Contributor

@starpact starpact commented Mar 18, 2026

Previously files are closed only after the whole iteration finishes:

defer rdr.Close()

causing resource leak within a single filesToDataFiles invocation as the file holds an open S3 read stream:
*blob.Reader

A sample trace can demonstrate it clearly:
image

Besides, I also added parallelism to AddFiles.

Future improvements:
Parallelism partially "fixes" the performance issue of AddFiles but a more fundamental improvement is related to IO, currently for extracting Parquet metadata, the IO layer needs to:

  1. do one GetObject and hold the read stream. The size is used by Seek, but the reader is never used(but still consumes quite some memory due to underlying buffering)
  2. do one range GetObject for the last 8 bytes
  3. do one range GetObject for the footer

A common trick is optimistically prefetching the footer(e.g. 512KB) to coleasce the above 3 requests. However, this requires leaking the IO abstraction, e.g. by determining the file format from the path and returning different implementations accordingly, which may require more discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant