Skip to content

Conversation

@carlopi
Copy link
Collaborator

@carlopi carlopi commented Jan 22, 2026

Test is:

CALL enable_logging('HTTP');
.timer on
FROM glob('s3://coiled-datasets/timeseries/*-years/parquet/*.parquet');
FROM duckdb_logs_parsed('HTTP') SELECT count(*);

that takes now about 2s and 4 requests (vs 42s and 46 requests, both main and v1.4-andium behaviour).

Builds on #216, attempting to solve the comment at #216 (comment).

I think this might be slightly worse IF a single glob is present in the middle, but not by much, and the gains are not bounded in case of multiple globs.

@carlopi carlopi force-pushed the faster_multi_globbing branch from 6906e6b to 269606a Compare January 22, 2026 14:49
@carlopi carlopi requested a review from Mytherin January 22, 2026 14:52
@Mytherin
Copy link
Contributor

Mytherin commented Jan 22, 2026

Thanks - bringing back the delimiter is fine by me (although I don't fully understand what this is doing that is causing this speed-up).

One thing - when I run this locally I get an error:

set s3_region='us-east-2';
FROM glob('s3://coiled-datasets/timeseries/*-years/parquet/*.parquet');
HTTP Error:
HTTP GET error reading 's3://coiled-datasets/timeserieshttps://coiled-datasets.s3.us-east-2.amazonaws.com/?encoding-type=url&list-type=2&prefix=timeseries%2F&delimiter=%2F' in region 'us-east-2' (HTTP 403 Forbidden)

Authentication Failure - this is usually caused by invalid or missing credentials.
* Credentials are provided, but they did not work.
* See https://duckdb.org/docs/stable/extensions/httpfs/s3api.html

LINE 1: FROM glob('s3://coiled-datasets/timeseries/*-years/parquet/*...
             ^

There seems to be something going wrong with concatenation of URLs (s3://coiled-datasets/timeserieshttps://...). Are you seeing this locally as well?

@carlopi
Copy link
Collaborator Author

carlopi commented Jan 22, 2026

I don't reproduce, this smells like you might have credentials or region enabled? That it's a longstanding (and weird) issue.

Can you check SELECT current_setting('s3_access_key_id'); (and family)?

@carlopi
Copy link
Collaborator Author

carlopi commented Jan 22, 2026

As to why this is faster, basically in the case of multiple folders, like:

my-dir/file1.parquet
my-dir/file2.parquet
other-dir/a/file.parquet
other-dir/a/file2.parquet
zz-dir/z/file2.parquet
other-folder/a/file2.parquet

Say that you glob on *-dir/a/*.parquet, before the PR it would list all files in the bucket, and then select 3rd and 4th.
After it would:

  1. list everything up to the delimiter (so getting back dir/ + other-folder/)
  2. iterating on my-dir, that matches, now listing on my-dir/a/*.parquet will get 0 files.
  3. iterationg on other-dir, that matches, now it should list every file in dir/a/*.parquet, getting 2 files.
  4. iterating on zz-dir, that matches
  5. iterating on other-folder, that do not matches

Now that I write it down, it can be more expensive since every level gets an extra listing, advantage it should be that those are parallelizable, while the single full folder scan it's not parallelizable.

@carlopi carlopi marked this pull request as draft January 22, 2026 15:18
@carlopi carlopi force-pushed the faster_multi_globbing branch 4 times, most recently from 1651893 to 336ae6f Compare January 27, 2026 11:52
@carlopi carlopi force-pushed the faster_multi_globbing branch from ee13c6b to cd8c3ab Compare February 1, 2026 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants