Add prefetcher reader for standard buckets. by googlyrahman · Pull Request #795 · fsspec/gcsfs

googlyrahman · 2026-03-30T06:27:44Z

Description generated by AI

Asynchronous Background Prefetcher

A new BackgroundPrefetcher class has been implemented in gcsfs/prefetcher.py ( source). This component is designed to:

Proactively Fetch Data: It spawns a background producer task that fetches sequential blocks of data before they are explicitly requested ( source).
Adaptive Blocksize: The engine dynamically adjusts its blocksize based on the history of requested read sizes ( source).
Sequential Streak Detection: Prefetching is triggered after detecting a "streak" of sequential reads ( source).
Optimized Slicing: Uses ctypes for a fast, low-overhead slice implementation (_fast_slice) to manage internal buffers ( source).

Core Refactoring for Concurrency

The file-fetching logic in gcsfs/core.py has been refactored to enable parallel downloads:

_cat_file Decomposition: _cat_file is now split into _cat_file_sequential and _cat_file_concurrent ( source).
Threshold-Based Routing: Concurrent fetching is automatically utilized when the requested data size exceeds MIN_CHUNK_SIZE_FOR_CONCURRENCY (defaulting to 5MB) and multiple concurrency slots are requested ( source).
Integration: The GCSFile object now optionally initializes the _prefetch_engine when the use_prefetch_reader flag is provided ( source).

codecov · 2026-03-30T06:38:29Z

Codecov Report

❌ Patch coverage is 99.17808% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.78%. Comparing base (3121be9) to head (5919b67).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
gcsfs/prefetcher.py	99.05%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #795      +/-   ##
==========================================
+ Coverage   76.64%   79.78%   +3.14%     
==========================================
  Files          14       15       +1     
  Lines        2663     3028     +365     
==========================================
+ Hits         2041     2416     +375     
+ Misses        622      612      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gcsfs/core.py

jasha26 · 2026-03-30T11:44:27Z

gcsfs/core.py

+        ) or os.environ.get("use_prefetch_reader", False)
+        if use_prefetch_reader:
+            max_prefetch_size = kwargs.get("max_prefetch_size", None)
+            concurrency = kwargs.get("concurrency", 4)


let's call this prefetcher_concurrency

I think concurrency is better term here, it is because _cat_file also uses this concurrency parameter in case prefetcher engine is disabled, so this is unrelated to prefetcher

jasha26 · 2026-03-30T11:46:47Z

gcsfs/core.py

+            await asyncio.gather(*tasks, return_exceptions=True)
+            raise e
+
+    async def _cat_file(self, path, start=None, end=None, concurrency=4, **kwargs):


let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY

Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?

let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY

Introduced a default in zb_hns_util.py named as DEFAULT_CONCURRENCY

Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?

_cat_file doesn't actually do any prefetching. We are just completing the existing call path (GCSFile -> cache -> GCSFile._fetch -> GCSFileSystem._cat_file). While it does not prefetch, it will now fetch chunks concurrently if the requested size is under 5MB.

gcsfs/core.py

gcsfs/prefetcher.py

jasha26 · 2026-04-01T11:14:19Z

gcsfs/prefetcher.py

+        fsspec.asyn.sync(self.loop, _start)
+        logger.debug("BackgroundPrefetcher initialization complete.")
+
+    def __enter__(self):


we can remove the context manager for now and see if this is something that community needs in the future. At this point prefetcher is an automatic insertion in fetch.

jasha26 · 2026-04-01T11:17:50Z

gcsfs/prefetcher.py

+
+        This clears any previous wakeup events and spawns the main loop task.
+        """
+        logger.info("Starting PrefetchProducer loop.")


let's make this debug

jasha26 · 2026-04-01T11:19:33Z

gcsfs/prefetcher.py

+        Args:
+            value (int): The integer value to add to the history.
+        """
+        if value <= 0:


let's remove this and instead let the error popup here

jasha26 · 2026-04-01T11:23:06Z

gcsfs/core.py

+        use_prefetch_reader = kwargs.get(
+            "use_experimental_adaptive_prefetching", False
+        ) or os.environ.get(
+            "use_experimental_adaptive_prefetching", "false"


let's capitalize this, to be consistent with other env variable naming convention

Can someone please explain why we cannot use normal cache_type= and cache_options= alone rather than having to invent a set of new environment variables (not to mention the extra kwargs)?

jasha26 · 2026-04-01T11:23:29Z

gcsfs/core.py

+        # there currently causes instantiation errors. We are holding off on introducing
+        # them as explicit keyword arguments to ensure existing user workloads are not
+        # disrupted. This will be refactored once the upstream `fsspec` changes are merged.
+        use_prefetch_reader = kwargs.get(


let's only do env variable for flag and not kwargs

jasha26 · 2026-04-01T11:24:04Z

gcsfs/core.py

+        ) or os.environ.get(
+            "use_experimental_adaptive_prefetching", "false"
+        ).lower() in (
+            "true",


"true", "1" should be fine here

googlyrahman changed the title ~~Add prefetcher engine for regional buckets.~~ Add prefetcher reader for regional buckets. Mar 30, 2026

googlyrahman marked this pull request as ready for review March 30, 2026 07:38

googlyrahman changed the title ~~Add prefetcher reader for regional buckets.~~ Add prefetcher reader for standard buckets. Mar 30, 2026