Skip to content

read avro files without knowing their size upfront#67

Open
achille-roussel wants to merge 2 commits intoduckdb:mainfrom
achille-roussel:avro-read-files-progressively
Open

read avro files without knowing their size upfront#67
achille-roussel wants to merge 2 commits intoduckdb:mainfrom
achille-roussel:avro-read-files-progressively

Conversation

@achille-roussel
Copy link

Remove HEAD requests when reading Avro files

Summary

This PR modifies the Avro file reading logic to avoid calling GetFileSize() before reading files. For HTTP-based filesystems, GetFileSize() triggers a HEAD request, adding an extra round-trip before the actual GET request. This change eliminates that overhead by reading files in chunks until EOF.

This is especially effective to reduce the number of roundtrips to object store in iceberg queries when the tables can contain hundreds of avro files.

Changes

  • Read files progressively: Instead of pre-allocating a buffer based on file size, read in 32KB chunks until EOF
  • Preserve caching: Continue using CachingFileSystem so files are still cached for repeated access
  • Use existing buffer pattern: Reuse AvroInMemoryBuffer from avro_copy.hpp for consistent memory management

Implementation

Before:

  GetFileSize() → HEAD request
  Read(total_size) → GET request

After:

  Read(32KB chunk) → GET request (repeats until EOF)

The file data is accumulated into a contiguous buffer which is then passed to avro_reader_memory().

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>
Signed-off-by: Achille Roussel <achille.roussel@gmail.com>
@achille-roussel
Copy link
Author

It's not clear who I'm supposed to reach out to for feedback on this change, it looks like @Tmonster or @samansmink might have merge rights on this repository.

Let me know if there's anything blocking a merge of this change.


data_ptr_t chunk_data = nullptr;
idx_t chunk_size = file_buffer->GetCapacity() - file_size;
auto chunk_handle = caching_file_handle->Read(chunk_data, chunk_size);
Copy link
Collaborator

@Tishj Tishj Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how this behaves with the caching file handle
It feels like this will degrade performance for larger files, as it will be doing a lot more small individual requests, as opposed to one big request

So I don't want to merge this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used this code successfully on production datasets and observed the positive impact which is why I'm opening this PR (this improvement is better homed in the community repository than in my forks).

I understand the concern on larger files, however, let's make a few considerations:

  • The code uses an exponential growth strategy, so the overall cost amortizes to O(1), for large files the size of reads quickly converges towards reading very large chunks
  • Avro can seek through files but as far as I can tell this is not used in duckdb-avro, the extension always loads the entire file in memory, so in that regard we don't introduce any regressions on the total memory footprint nor the ability to partially load the files

Would a change that gates this behavior behind a variable be a better middle ground? I can revisit the code to keep the historical behavior as default but add a variable to enable reading files as a stream instead of calling GetFileSize. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants