-
Notifications
You must be signed in to change notification settings - Fork 409
Closed
Description
When wrapping an S3FileSystem
with a CachingFileSystem
I am experiencing quite bad performance for reading small random chunks from files on S3.
After some digging I found out that this is due to redundant fetches of the file size in the _open
method here (which in my case is called from S3File and this in turn from the upstream CachingFileSystem).
The performance could be significantly enhanced in such cases, if the CachingFileSystem
knew about the file size via the cached metadata and would pass it on to the _open
method like this:
# In CachingFileSystem._open, store size when first retrieved:
detail = {
# existing fields...
"size": f.size # Add file size to cache metadata
}
# Then when opening again, use the cached size:
f = self.fs._open(
path,
# other params...
size=detail.get("size"), # Pass size to avoid redundant lookup
)
In my case I am seeing a performance improvement from 15 seconds to 1.15 seconds for a simple test case, when avoiding the repeated file-size lookup.
If desired, I can submit a PR implementing this change.
Metadata
Metadata
Assignees
Labels
No labels