Skip to content

Add prefetcher reader for standard buckets.#795

Open
googlyrahman wants to merge 1 commit intofsspec:mainfrom
ankitaluthra1:regional
Open

Add prefetcher reader for standard buckets.#795
googlyrahman wants to merge 1 commit intofsspec:mainfrom
ankitaluthra1:regional

Conversation

@googlyrahman
Copy link
Copy Markdown
Contributor

Description generated by AI

Asynchronous Background Prefetcher

A new BackgroundPrefetcher class has been implemented in gcsfs/prefetcher.py ( source). This component is designed to:

  • Proactively Fetch Data: It spawns a background producer task that fetches sequential blocks of data before they are explicitly requested ( source).
  • Adaptive Blocksize: The engine dynamically adjusts its blocksize based on the history of requested read sizes ( source).
  • Sequential Streak Detection: Prefetching is triggered after detecting a "streak" of sequential reads ( source).
  • Optimized Slicing: Uses ctypes for a fast, low-overhead slice implementation (_fast_slice) to manage internal buffers ( source).

Core Refactoring for Concurrency

The file-fetching logic in gcsfs/core.py has been refactored to enable parallel downloads:

  • _cat_file Decomposition: _cat_file is now split into _cat_file_sequential and _cat_file_concurrent ( source).
  • Threshold-Based Routing: Concurrent fetching is automatically utilized when the requested data size exceeds MIN_CHUNK_SIZE_FOR_CONCURRENCY (defaulting to 5MB) and multiple concurrency slots are requested ( source).
  • Integration: The GCSFile object now optionally initializes the _prefetch_engine when the use_prefetch_reader flag is provided ( source).

@googlyrahman googlyrahman changed the title Add prefetcher engine for regional buckets. Add prefetcher reader for regional buckets. Mar 30, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 99.17808% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.78%. Comparing base (3121be9) to head (5919b67).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
gcsfs/prefetcher.py 99.05% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #795      +/-   ##
==========================================
+ Coverage   76.64%   79.78%   +3.14%     
==========================================
  Files          14       15       +1     
  Lines        2663     3028     +365     
==========================================
+ Hits         2041     2416     +375     
+ Misses        622      612      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@googlyrahman googlyrahman marked this pull request as ready for review March 30, 2026 07:38
@googlyrahman googlyrahman changed the title Add prefetcher reader for regional buckets. Add prefetcher reader for standard buckets. Mar 30, 2026
gcsfs/core.py Outdated
) or os.environ.get("use_prefetch_reader", False)
if use_prefetch_reader:
max_prefetch_size = kwargs.get("max_prefetch_size", None)
concurrency = kwargs.get("concurrency", 4)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this prefetcher_concurrency

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think concurrency is better term here, it is because _cat_file also uses this concurrency parameter in case prefetcher engine is disabled, so this is unrelated to prefetcher

gcsfs/core.py Outdated
await asyncio.gather(*tasks, return_exceptions=True)
raise e

async def _cat_file(self, path, start=None, end=None, concurrency=4, **kwargs):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also move this to constant namely DEFAULT_PREFETCHER_CONCURRENCY

Introduced a default in zb_hns_util.py named as DEFAULT_CONCURRENCY

Aren't we mixing concerns here? _cat_file does not necessarily use the prefetcher at all. Indeed, why is prefetcher an option, when this is a single blob read, not sequential?

_cat_file doesn't actually do any prefetching. We are just completing the existing call path (GCSFile -> cache -> GCSFile._fetch -> GCSFileSystem._cat_file). While it does not prefetch, it will now fetch chunks concurrently if the requested size is under 5MB.

@googlyrahman googlyrahman force-pushed the regional branch 5 times, most recently from 82240a5 to 44d34f3 Compare March 31, 2026 21:48
fsspec.asyn.sync(self.loop, _start)
logger.debug("BackgroundPrefetcher initialization complete.")

def __enter__(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove the context manager for now and see if this is something that community needs in the future. At this point prefetcher is an automatic insertion in fetch.


This clears any previous wakeup events and spawns the main loop task.
"""
logger.info("Starting PrefetchProducer loop.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this debug

Args:
value (int): The integer value to add to the history.
"""
if value <= 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this and instead let the error popup here

use_prefetch_reader = kwargs.get(
"use_experimental_adaptive_prefetching", False
) or os.environ.get(
"use_experimental_adaptive_prefetching", "false"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's capitalize this, to be consistent with other env variable naming convention

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone please explain why we cannot use normal cache_type= and cache_options= alone rather than having to invent a set of new environment variables (not to mention the extra kwargs)?

# there currently causes instantiation errors. We are holding off on introducing
# them as explicit keyword arguments to ensure existing user workloads are not
# disrupted. This will be refactored once the upstream `fsspec` changes are merged.
use_prefetch_reader = kwargs.get(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's only do env variable for flag and not kwargs

) or os.environ.get(
"use_experimental_adaptive_prefetching", "false"
).lower() in (
"true",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"true", "1" should be fine here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants