Skip to content

potential performance improvement for GSPath globbing capabilities #513

@fafnirZ

Description

@fafnirZ

Hey, I've been using the GSPath globbing capabilities to glob over a fairly large GCS bucket (couple of gbs) and have been noticing that it takes a lot longer to process compared to a google-cloud-storage implementation.

list_blobs(match_glob="**/version_1/**")

Furthermore, when having task manager open when performing a glob on the bucket I observe significantly higher network footprint (when using GSPath) in comparison to the list_blobs implementation.

My guess is that cloudpathlib may potentially be sending more network request than necessary (correct me if I'm wrong)

Any reasons why we don't just leverage the match_glob arg for GSPath's glob capabilities?

GCloud SDK list_blobs(match_glob="") reference below:
https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L1407

a GSPath("/path/to/folder/").glob("**/version_1/**) to my belief can be translated to list_blobs by doing the following:
list_blobs(prefix="/path/to/folder", match_glob="**/version_1/**")

happy to submit something if you would like this change incorporated :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions