Skip to content

_glob performs full directory/bucket scans - unnecessarily #1995

@dhrp

Description

@dhrp

TL;DR: Using a glob wih FSspec in combination with s3fs or gcsfs has unnecessary slow performance. The proposed fix is to pass filename stem as prefix= hint to _find for server-side filtering

Background

When _glob is called with a pattern like gs://my-bucket/data/2024/somefile.* or gs://my-bucket/somefile.*, it calls _find(root, ...) with no filtering hint. The backend must list every object under root and return the full result set to Python, where the glob pattern is then applied in-memory. For a bucket with hundreds of thousands of objects in one "folder" this means transferring a full listing, with many page requests, loading all results even when only a tiny fraction of results are relevant.

We noticed this degradation as our bucket on GCS grew in size.

Proposed fix

Extract the literal stem between the last / and the first wildcard character (*, ?, [), and pass it as prefix= in the kwargs forwarded to _find. Backends that understand prefix= (gcsfs, s3fs, adlfs) use it to filter the listing server-side via the storage API's ?prefix= parameter. Backends that don't understand it receive it in **kwargs and silently ignore it — no behaviour change for them.

gcsfs already concatenates the path and the prefix like so:

async def _do_list_objects(self, path, ..., prefix="", ...):
    bucket, _path, generation = self.split_path(path)
    _path = "" if not _path else _path.rstrip("/") + "/"
    prefix = f"{_path}{prefix}" or None

Compatibility

I've tried to (let AI) review all drivers.
The built-in drivers: http, reference, dirfs, memory, ftp, sftp, git, etc.; and the external drivers: sshfs and dropboxdrivefs absorb it silently via **kwargs. But gcsfs (Google Storage) and adlfs (Azure Data Lake Storage API) support said prefix usage.

The important exception is: s3fs. Its _find override explicitly raises ValueError when both prefix and withdirs are non-falsy:

# s3fs/core.py
if (withdirs or maxdepth) and prefix:
    raise ValueError(
        "Can not specify 'prefix' option alongside 'withdirs'/'maxdepth' options."
    )

Since _glob always calls _find(..., withdirs=True, ...) internally, this guard fires on every prefixed glob against s3fs. A fix on s3fs would be needed. Interestingly enough there is already a # TODO: perhaps propagate these to a glob(f"path/{prefix}*") call comment at that exact line in s3fs, indicating it was always known to be provisional.

Review

I could not find any existing issues or PRs on either repo currently that describe this problem. (But) I do think implementing this fix could save a lot of resources!

Our (modest) bucket has 27000 files; and triggers 27 GET requests, and takes 5 seconds to find that one file.

Contributing this

I've already tried to patch it to asyn.py, It's actually a small fix. It resolves our issue on gcsfs and all tests still pass, but I'd need some guidance to understand if other (external) things may break due to this.. And of course a patch on s3fs would also need to be made.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions