Add back handling for use_delimiter, making globs with multiple * faster #218

carlopi · 2026-01-22T14:43:27Z

Test is:

CALL enable_logging('HTTP');
.timer on
FROM glob('s3://coiled-datasets/timeseries/*-years/parquet/*.parquet');
FROM duckdb_logs_parsed('HTTP') SELECT count(*);

that takes now about 2s and 4 requests (vs 42s and 46 requests, both main and v1.4-andium behaviour).

Builds on #216, attempting to solve the comment at #216 (comment).

I think this might be slightly worse IF a single glob is present in the middle, but not by much, and the gains are not bounded in case of multiple globs.

Mytherin · 2026-01-22T14:57:46Z

Thanks - bringing back the delimiter is fine by me (although I don't fully understand what this is doing that is causing this speed-up).

One thing - when I run this locally I get an error:

set s3_region='us-east-2';
FROM glob('s3://coiled-datasets/timeseries/*-years/parquet/*.parquet');

HTTP Error:
HTTP GET error reading 's3://coiled-datasets/timeserieshttps://coiled-datasets.s3.us-east-2.amazonaws.com/?encoding-type=url&list-type=2&prefix=timeseries%2F&delimiter=%2F' in region 'us-east-2' (HTTP 403 Forbidden)

Authentication Failure - this is usually caused by invalid or missing credentials.
* Credentials are provided, but they did not work.
* See https://duckdb.org/docs/stable/extensions/httpfs/s3api.html

LINE 1: FROM glob('s3://coiled-datasets/timeseries/*-years/parquet/*...
             ^

There seems to be something going wrong with concatenation of URLs (s3://coiled-datasets/timeserieshttps://...). Are you seeing this locally as well?

carlopi · 2026-01-22T15:02:29Z

I don't reproduce, this smells like you might have credentials or region enabled? That it's a longstanding (and weird) issue.

Can you check SELECT current_setting('s3_access_key_id'); (and family)?

carlopi · 2026-01-22T15:09:02Z

As to why this is faster, basically in the case of multiple folders, like:

my-dir/file1.parquet
my-dir/file2.parquet
other-dir/a/file.parquet
other-dir/a/file2.parquet
zz-dir/z/file2.parquet
other-folder/a/file2.parquet

Say that you glob on *-dir/a/*.parquet, before the PR it would list all files in the bucket, and then select 3rd and 4th.
After it would:

list everything up to the delimiter (so getting back dir/ + other-folder/)
iterating on my-dir, that matches, now listing on my-dir/a/*.parquet will get 0 files.
iterationg on other-dir, that matches, now it should list every file in dir/a/*.parquet, getting 2 files.
iterating on zz-dir, that matches
iterating on other-folder, that do not matches

Now that I write it down, it can be more expensive since every level gets an extra listing, advantage it should be that those are parallelizable, while the single full folder scan it's not parallelizable.

carlopi force-pushed the faster_multi_globbing branch from 6906e6b to 269606a Compare January 22, 2026 14:49

carlopi requested a review from Mytherin January 22, 2026 14:52

carlopi marked this pull request as draft January 22, 2026 15:18

carlopi force-pushed the faster_multi_globbing branch 4 times, most recently from 1651893 to 336ae6f Compare January 27, 2026 11:52

carlopi added 8 commits February 1, 2026 23:53

Restore support for optional use_delimiter, sort req_params

5b15ceb

Faster globbing

89c1d20

Add completed param to Match

9cd4cbf

More fixes

05ee78e

Enforce ordering of URL paramters via map

42630d4

More fixes

238c7c6

Adapt glob tests, via

1c08038

Decode path

cd8c3ab

carlopi force-pushed the faster_multi_globbing branch from ee13c6b to cd8c3ab Compare February 1, 2026 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add back handling for use_delimiter, making globs with multiple * faster #218

Add back handling for use_delimiter, making globs with multiple * faster #218

Uh oh!

carlopi commented Jan 22, 2026 •

edited

Loading

Uh oh!

Mytherin commented Jan 22, 2026 •

edited

Loading

Uh oh!

carlopi commented Jan 22, 2026

Uh oh!

carlopi commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add back handling for use_delimiter, making globs with multiple * faster #218

Are you sure you want to change the base?

Add back handling for use_delimiter, making globs with multiple * faster #218

Uh oh!

Conversation

carlopi commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mytherin commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlopi commented Jan 22, 2026

Uh oh!

carlopi commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

carlopi commented Jan 22, 2026 •

edited

Loading

Mytherin commented Jan 22, 2026 •

edited

Loading