read avro files without knowing their size upfront by achille-roussel · Pull Request #67 · duckdb/duckdb-avro

achille-roussel · 2026-01-02T08:17:35Z

Remove HEAD requests when reading Avro files

Summary

This PR modifies the Avro file reading logic to avoid calling GetFileSize() before reading files. For HTTP-based filesystems, GetFileSize() triggers a HEAD request, adding an extra round-trip before the actual GET request. This change eliminates that overhead by reading files in chunks until EOF.

This is especially effective to reduce the number of roundtrips to object store in iceberg queries when the tables can contain hundreds of avro files.

Changes

Read files progressively: Instead of pre-allocating a buffer based on file size, read in 32KB chunks until EOF
Preserve caching: Continue using CachingFileSystem so files are still cached for repeated access
Use existing buffer pattern: Reuse AvroInMemoryBuffer from avro_copy.hpp for consistent memory management

Implementation

Before:

  GetFileSize() → HEAD request
  Read(total_size) → GET request

After:

  Read(32KB chunk) → GET request (repeats until EOF)

The file data is accumulated into a contiguous buffer which is then passed to avro_reader_memory().

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>

achille-roussel · 2026-01-19T05:52:40Z

It's not clear who I'm supposed to reach out to for feedback on this change, it looks like @Tmonster or @samansmink might have merge rights on this repository.

Let me know if there's anything blocking a merge of this change.

Tishj · 2026-02-15T15:53:05Z

src/avro_reader.cpp

+
+		data_ptr_t chunk_data = nullptr;
+		idx_t chunk_size = file_buffer->GetCapacity() - file_size;
+		auto chunk_handle = caching_file_handle->Read(chunk_data, chunk_size);


I don't know how this behaves with the caching file handle
It feels like this will degrade performance for larger files, as it will be doing a lot more small individual requests, as opposed to one big request

So I don't want to merge this

I have used this code successfully on production datasets and observed the positive impact which is why I'm opening this PR (this improvement is better homed in the community repository than in my forks).

I understand the concern on larger files, however, let's make a few considerations:

The code uses an exponential growth strategy, so the overall cost amortizes to O(1), for large files the size of reads quickly converges towards reading very large chunks

Avro can seek through files but as far as I can tell this is not used in duckdb-avro, the extension always loads the entire file in memory, so in that regard we don't introduce any regressions on the total memory footprint nor the ability to partially load the files

Would a change that gates this behavior behind a variable be a better middle ground? I can revisit the code to keep the historical behavior as default but add a variable to enable reading files as a stream instead of calling GetFileSize. What do you think?

achille-roussel added 2 commits January 1, 2026 23:21

read avro files without knowing their size upfront

46dd070

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>

try to fill as much of the available buffer space as possible

b073972

Signed-off-by: Achille Roussel <achille.roussel@gmail.com>

achille-roussel mentioned this pull request Feb 2, 2026

Optimize manifest loading: parallel reads with file_size to avoid HEAD requests duckdb/duckdb-iceberg#679

Merged

Tishj reviewed Feb 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read avro files without knowing their size upfront#67

read avro files without knowing their size upfront#67
achille-roussel wants to merge 2 commits intoduckdb:mainfrom
achille-roussel:avro-read-files-progressively

achille-roussel commented Jan 2, 2026

Uh oh!

achille-roussel commented Jan 19, 2026

Uh oh!

Tishj Feb 15, 2026 •

edited

Loading

Uh oh!

achille-roussel Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

achille-roussel commented Jan 2, 2026

Remove HEAD requests when reading Avro files

Summary

Changes

Implementation

Uh oh!

achille-roussel commented Jan 19, 2026

Uh oh!

Tishj Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

achille-roussel Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tishj Feb 15, 2026 •

edited

Loading