You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
`FsspecFileIO.get_fs` can be called by multiple threads when
`ExecutorFactory` is used (for example by `DataScan.plan_files`).
The base class of `fsspec` filesystem objects,
`fsspec.spec.AbstractFileSystem`, internally caches instances through
the `fsspec.spec._Cached` metaclass. The caching key used includes
`threading.get_ident()`, making entries thread-local:
https://github.com/fsspec/filesystem_spec/blob/f84b99f0d1f079f990db1a219b74df66ab3e7160/fsspec/spec.py#L71
The `FsspecFileIO.get_fs` LRU cache (around `FsspecFileIO._get_fs`)
breaks the thread-locality of the filesystem instances as it will return
the same instance for different threads.
One consequence of this is that for `s3fs.S3FileSystem`, HTTP connection
pooling no longer occurs per thread (as is normal with `aiobotocore`),
as the `aiobotocore` client object (containing the
`aiohttp.ClientSession`) is stored on the `s3fs.S3FileSystem`.
This change addresses this by making the `FsspecFileIO.get_fs` cache
thread-local.
## Are these changes tested?
Tested locally. Unit test included covering the caching behaviour.
## Are there any user-facing changes?
Yes - S3 HTTP connection pooling now occurs per-thread, matching the
behaviour of `aiobotocore` when it used in the recommended way with an
event loop per thread.
<!-- In the case of user-facing changes, please add the changelog label.
-->
0 commit comments