Skip to content

CachingFileSystem combined with S3FileSystem redundantly fetches file size from S3 #1832

@terraputix

Description

@terraputix

When wrapping an S3FileSystem with a CachingFileSystem I am experiencing quite bad performance for reading small random chunks from files on S3.

After some digging I found out that this is due to redundant fetches of the file size in the _open method here (which in my case is called from S3File and this in turn from the upstream CachingFileSystem).

The performance could be significantly enhanced in such cases, if the CachingFileSystem knew about the file size via the cached metadata and would pass it on to the _open method like this:

# In CachingFileSystem._open, store size when first retrieved:
detail = {
    # existing fields...
    "size": f.size  # Add file size to cache metadata
}

# Then when opening again, use the cached size:
f = self.fs._open(
    path,
    # other params...
    size=detail.get("size"),  # Pass size to avoid redundant lookup
)

In my case I am seeing a performance improvement from 15 seconds to 1.15 seconds for a simple test case, when avoiding the repeated file-size lookup.

If desired, I can submit a PR implementing this change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions