CachingFileSystem combined with S3FileSystem redundantly fetches file size from S3

When wrapping an `S3FileSystem` with a `CachingFileSystem` I am experiencing quite bad performance for reading small random chunks from files on S3.

After some digging I found out that this is due to redundant fetches of the file size in the `_open` method [here](https://github.com/fsspec/filesystem_spec/blob/0d5d47ed2a8ca829ab1e7fa9ffb614951728cad1/fsspec/spec.py#L1912) (which in my case is called from [S3File](https://github.com/fsspec/s3fs/blob/c1701ccff5e0616883ae851259d8d978c7e70d2d/s3fs/core.py#L2259) and this in turn from the upstream [CachingFileSystem](https://github.com/fsspec/filesystem_spec/blob/0d5d47ed2a8ca829ab1e7fa9ffb614951728cad1/fsspec/implementations/cached.py#L340)). 

The performance could be significantly enhanced in such cases, if the `CachingFileSystem` knew about the file size via the cached metadata and would pass it on to the  `_open` method like this:
```python
# In CachingFileSystem._open, store size when first retrieved:
detail = {
    # existing fields...
    "size": f.size  # Add file size to cache metadata
}

# Then when opening again, use the cached size:
f = self.fs._open(
    path,
    # other params...
    size=detail.get("size"),  # Pass size to avoid redundant lookup
)
```

In my case I am seeing a performance improvement from 15 seconds to 1.15 seconds for a simple test case, when avoiding the repeated file-size lookup. 

If desired, I can submit a PR implementing this change. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CachingFileSystem combined with S3FileSystem redundantly fetches file size from S3 #1832

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CachingFileSystem combined with S3FileSystem redundantly fetches file size from S3 #1832

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions