Add prefetcher reader for standard buckets.#795
Add prefetcher reader for standard buckets.#795googlyrahman wants to merge 1 commit intofsspec:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #795 +/- ##
==========================================
+ Coverage 76.64% 79.78% +3.14%
==========================================
Files 14 15 +1
Lines 2663 3028 +365
==========================================
+ Hits 2041 2416 +375
+ Misses 622 612 -10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
gcsfs/core.py
Outdated
| ) or os.environ.get("use_prefetch_reader", False) | ||
| if use_prefetch_reader: | ||
| max_prefetch_size = kwargs.get("max_prefetch_size", None) | ||
| concurrency = kwargs.get("concurrency", 4) |
There was a problem hiding this comment.
let's call this prefetcher_concurrency
There was a problem hiding this comment.
I think concurrency is better term here, it is because _cat_file also uses this concurrency parameter in case prefetcher engine is disabled, so this is unrelated to prefetcher
gcsfs/core.py
Outdated
| await asyncio.gather(*tasks, return_exceptions=True) | ||
| raise e | ||
|
|
||
| async def _cat_file(self, path, start=None, end=None, concurrency=4, **kwargs): |
There was a problem hiding this comment.
let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY
There was a problem hiding this comment.
Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?
There was a problem hiding this comment.
let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY
Introduced a default in zb_hns_util.py named as DEFAULT_CONCURRENCY
Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?
_cat_file doesn't actually do any prefetching. We are just completing the existing call path (GCSFile -> cache -> GCSFile._fetch -> GCSFileSystem._cat_file). While it does not prefetch, it will now fetch chunks concurrently if the requested size is under 5MB.
82240a5 to
44d34f3
Compare
| fsspec.asyn.sync(self.loop, _start) | ||
| logger.debug("BackgroundPrefetcher initialization complete.") | ||
|
|
||
| def __enter__(self): |
There was a problem hiding this comment.
we can remove the context manager for now and see if this is something that community needs in the future. At this point prefetcher is an automatic insertion in fetch.
|
|
||
| This clears any previous wakeup events and spawns the main loop task. | ||
| """ | ||
| logger.info("Starting PrefetchProducer loop.") |
| Args: | ||
| value (int): The integer value to add to the history. | ||
| """ | ||
| if value <= 0: |
There was a problem hiding this comment.
let's remove this and instead let the error popup here
| use_prefetch_reader = kwargs.get( | ||
| "use_experimental_adaptive_prefetching", False | ||
| ) or os.environ.get( | ||
| "use_experimental_adaptive_prefetching", "false" |
There was a problem hiding this comment.
let's capitalize this, to be consistent with other env variable naming convention
There was a problem hiding this comment.
Can someone please explain why we cannot use normal cache_type= and cache_options= alone rather than having to invent a set of new environment variables (not to mention the extra kwargs)?
| # there currently causes instantiation errors. We are holding off on introducing | ||
| # them as explicit keyword arguments to ensure existing user workloads are not | ||
| # disrupted. This will be refactored once the upstream `fsspec` changes are merged. | ||
| use_prefetch_reader = kwargs.get( |
There was a problem hiding this comment.
let's only do env variable for flag and not kwargs
| ) or os.environ.get( | ||
| "use_experimental_adaptive_prefetching", "false" | ||
| ).lower() in ( | ||
| "true", |
There was a problem hiding this comment.
"true", "1" should be fine here
Description generated by AI
Asynchronous Background Prefetcher
A new BackgroundPrefetcher class has been implemented in gcsfs/prefetcher.py ( source). This component is designed to:
Core Refactoring for Concurrency
The file-fetching logic in gcsfs/core.py has been refactored to enable parallel downloads: