-
Notifications
You must be signed in to change notification settings - Fork 57
HTTPClient caching across FileHandles #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
23f1960 to
eaf9ca6
Compare
eaf9ca6 to
39c35ce
Compare
lnkuiper
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I have left two minor comments below, and I have a question about testing. Our current HTTP stats (and logs) currently don't show # of handshakes:
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ HTTPFS HTTP Stats ││
││ ││
││ in: 131.2 MiB ││
││ out: 0 bytes ││
││ #HEAD: 1 ││
││ #GET: 9 ││
││ #PUT: 0 ││
││ #POST: 0 ││
││ #DELETE: 0 ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
Would it be possible to measure this and add a test so we can guarantee that this continues to work?
| shared_ptr<HTTPClientCache> HTTPFileSystem::GetOrCreateClientCache(const string &path) { | ||
| lock_guard<mutex> lock(client_cache_map_lock); | ||
|
|
||
| if (auto existing = lru_client_cache.Get(path)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you're doing here, and it looks correct, but we try to avoid assignment in if conditions because it's error-prone. Could you split this up?
| SELECT count(*) FROM duckdb_logs_parsed('HTTP') WHERE request.type = 'GET' GROUP BY request.type; | ||
| ---- | ||
| 9 | ||
| 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is already in a patch in the duckdb repo on v1.5-variegata. Not a problem if you add the change here too as long as CI is happy
Currently
HTTPClientCaches are per FileHandle, after this PR they each FileHandle has ashared_ptr<HTTPClientCache>that allows to share the clients across the same base-url.An LruCache is kept to reachable
HTTPClientCache's, that allows repeated network call, even acrossFileHandle's, to avoid unnecessary network handshakes.This builds on top of LruCache made available in
duckdb/duckdbin duckdb/duckdb#20157 (and generalized in duckdb/duckdb#20757 and the infrastructure to re-initialized clients, recently touched on via #232.Ownership model is fully RAII, thanks to
shared_ptr's, note that this means that there might be still aliveHTTPClientCache, say a very long livedFileHandle, but those are not reachable anymore.This can be improved on, but for now PR to improve current status and add infrastructure.
Cache size is fixed at 256, arguably it should be configurable.
This should somewhat visibly improve performances of repeated operations to same
BaseUrl(), but apart from timing this should not bring observable differences.CI-wise, main goal here is checking whether it's a pass.