Skip to content

Conversation

@carlopi
Copy link
Collaborator

@carlopi carlopi commented Feb 2, 2026

Currently HTTPClientCaches are per FileHandle, after this PR they each FileHandle has a shared_ptr<HTTPClientCache> that allows to share the clients across the same base-url.

An LruCache is kept to reachable HTTPClientCache's, that allows repeated network call, even across FileHandle's, to avoid unnecessary network handshakes.

This builds on top of LruCache made available in duckdb/duckdb in duckdb/duckdb#20157 (and generalized in duckdb/duckdb#20757 and the infrastructure to re-initialized clients, recently touched on via #232.

Ownership model is fully RAII, thanks to shared_ptr's, note that this means that there might be still alive HTTPClientCache, say a very long lived FileHandle, but those are not reachable anymore.
This can be improved on, but for now PR to improve current status and add infrastructure.
Cache size is fixed at 256, arguably it should be configurable.

This should somewhat visibly improve performances of repeated operations to same BaseUrl(), but apart from timing this should not bring observable differences.
CI-wise, main goal here is checking whether it's a pass.

@carlopi carlopi requested a review from lnkuiper February 2, 2026 15:59
@carlopi carlopi force-pushed the get_or_create_client_cache branch from 23f1960 to eaf9ca6 Compare February 2, 2026 16:39
@carlopi carlopi force-pushed the get_or_create_client_cache branch from eaf9ca6 to 39c35ce Compare February 2, 2026 16:39
@carlopi carlopi marked this pull request as draft February 2, 2026 20:33
Copy link
Collaborator

@lnkuiper lnkuiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have left two minor comments below, and I have a question about testing. Our current HTTP stats (and logs) currently don't show # of handshakes:

┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         HTTPFS HTTP Stats         ││
││                                   ││
││           in: 131.2 MiB           ││
││            out: 0 bytes           ││
││              #HEAD: 1             ││
││              #GET: 9              ││
││              #PUT: 0              ││
││              #POST: 0             ││
││             #DELETE: 0            ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘

Would it be possible to measure this and add a test so we can guarantee that this continues to work?

shared_ptr<HTTPClientCache> HTTPFileSystem::GetOrCreateClientCache(const string &path) {
lock_guard<mutex> lock(client_cache_map_lock);

if (auto existing = lru_client_cache.Get(path)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you're doing here, and it looks correct, but we try to avoid assignment in if conditions because it's error-prone. Could you split this up?

SELECT count(*) FROM duckdb_logs_parsed('HTTP') WHERE request.type = 'GET' GROUP BY request.type;
----
9
8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is already in a patch in the duckdb repo on v1.5-variegata. Not a problem if you add the change here too as long as CI is happy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants