read avro files without knowing their size upfront#67
read avro files without knowing their size upfront#67achille-roussel wants to merge 2 commits intoduckdb:mainfrom
Conversation
Signed-off-by: Achille Roussel <achille.roussel@gmail.com>
Signed-off-by: Achille Roussel <achille.roussel@gmail.com>
|
It's not clear who I'm supposed to reach out to for feedback on this change, it looks like @Tmonster or @samansmink might have merge rights on this repository. Let me know if there's anything blocking a merge of this change. |
|
|
||
| data_ptr_t chunk_data = nullptr; | ||
| idx_t chunk_size = file_buffer->GetCapacity() - file_size; | ||
| auto chunk_handle = caching_file_handle->Read(chunk_data, chunk_size); |
There was a problem hiding this comment.
I don't know how this behaves with the caching file handle
It feels like this will degrade performance for larger files, as it will be doing a lot more small individual requests, as opposed to one big request
So I don't want to merge this
There was a problem hiding this comment.
I have used this code successfully on production datasets and observed the positive impact which is why I'm opening this PR (this improvement is better homed in the community repository than in my forks).
I understand the concern on larger files, however, let's make a few considerations:
- The code uses an exponential growth strategy, so the overall cost amortizes to O(1), for large files the size of reads quickly converges towards reading very large chunks
- Avro can seek through files but as far as I can tell this is not used in duckdb-avro, the extension always loads the entire file in memory, so in that regard we don't introduce any regressions on the total memory footprint nor the ability to partially load the files
Would a change that gates this behavior behind a variable be a better middle ground? I can revisit the code to keep the historical behavior as default but add a variable to enable reading files as a stream instead of calling GetFileSize. What do you think?
Remove HEAD requests when reading Avro files
Summary
This PR modifies the Avro file reading logic to avoid calling GetFileSize() before reading files. For HTTP-based filesystems, GetFileSize() triggers a HEAD request, adding an extra round-trip before the actual GET request. This change eliminates that overhead by reading files in chunks until EOF.
This is especially effective to reduce the number of roundtrips to object store in iceberg queries when the tables can contain hundreds of avro files.
Changes
Implementation
Before:
After:
The file data is accumulated into a contiguous buffer which is then passed to avro_reader_memory().