Implement retries for Storage control client calls by Mahalaxmibejugam · Pull Request #787 · fsspec/gcsfs

Mahalaxmibejugam · 2026-03-25T11:26:45Z

Implement a robust, time-bound, and count-bounded idempotent retry strategy for Google Cloud Storage Control (HNS) folder operations

Implementation Details

Time-Bound Wrapper: Created an asynchronous executor that enforces strict per-attempt timeout limits based on a configured retry_deadline. It includes a 1.0-second grace window to allow native GCS server-side timeout errors to bubble up automatically.
Exact Retry Count: Caps retry attempts rigorously at a maximum of 6 (max_retries=6).
Strict Idempotence: Instantiates request objects (including their specific UUID request_id) outside the retry loop so that the same request ID is carried perfectly to all transport retries, allowing GCS to safely deduplicate requests.
Backoff & Jitter: Employs an exponential test backoff sequence min(random.random() + 2**(attempt-1), 32) for standard transient exceptions.

Verification

Detailed unit tests verifying time-bound limits, accurate Request IDs, and threshold capping.
Verified whether error translations and Request IDs map correctly between our custom methods and the underlying cloud client.
Verified end-to-end retry and timeout interactions against the real Google Cloud Storage server.

…ge control client folder operations.

jasha26 · 2026-03-25T14:17:56Z

gcsfs/retry.py

Instead of the custom asyncio.wait_for logic, using a library like tenacity would look like this. Also did we evaluate google.api_core AsyncRetry:

from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type @retry( wait=wait_exponential(multiplier=1, min=2, max=32), stop=stop_after_attempt(6), retry=retry_if_exception_type((api_exceptions.ServiceUnavailable, asyncio.TimeoutError)), reraise=True ) async def call_with_retry(func, *args, **kwargs): return await func(*args, **kwargs)

We need asyncio.wait_for logic on the client side to make sure that we handle the request stalls and won't wait indefinitely for the call to return. Replaced the custom logic with tenacity as it provides in-built support for retries. AsyncRetry also provides the same functionality but would still need the asyncio.wait_for logic on the client side.

Do we really need another dependency?

I had a similar opinion, but @jasha26 recommended using Tenacity as it might help in future integrations like client-side throttling

AsyncRetry from google.api_core supports retries based on max_timeout instead of max_retries(which is followed in gcsfs for other JSON API retries). To keep the same retry behaviour for JSON APIs and storage control client calls(i.e., limiting the number of retries) I have implemented the custom logic. To use AsyncRetry with the constraint on number of attempts we would have to maintain a wrapper to track the number of attempts which would be almost same as the initial version of this PR without adding much benefit of using AsyncRetry

So we can either have entirely custom implementation or use Tenacity if we want to maintain the max_attempts behaviour across GCSFS

jasha26 · 2026-03-25T14:42:20Z

gcsfs/retry.py

+
+
+async def execute_with_timebound_retry(
+    func, *args, retry_deadline=30.0, max_retries=6, **kwargs


We can't really hard code these values, we need the ability so these can be overidden via multiple mechanisms like call site overrides, fsspec config overrides etc.

So i'd recommend we do something like below:

from fsspec.config import conf @dataclass class RetryConfig: max_retries: int = 6 min_delay: float = 2.0 max_delay: float = 32.0 retry_deadline: float = 30.0 def get_resolved_retry_config(call_kwargs) -> RetryConfig: """ Resolves retry configuration with a clear hierarchy of overrides: 1. Explicit call-site arguments (e.g., max_retries=10) 2. fsspec.config settings (e.g., ~/.config/fsspec/conf.json) 3. Hardcoded Defaults from the RetryConfig template """ # 1. Start with the default template default = RetryConfig() # 2. Resolve parameters from Env Vars or fsspec.config, or use defaults resolved_max_retries = int( call_kwargs.get("max_retries") or conf.get("gcsfs.retry.max_retries", default.max_retries) ) resolved_deadline = float( call_kwargs.get("retry_deadline") or conf.get("gcsfs.retry.deadline", default.retry_deadline) ) return RetryConfig( max_retries=resolved_max_retries, retry_deadline=resolved_deadline, min_delay=default.min_delay, max_delay=default.max_delay ) async def with_retry(func, *args, **kwargs): config = get_resolved_retry_config(kwargs) # Define transient errors consistent with GCS client best practices. RETRYABLE_ERRORS = ( api_exceptions.ServiceUnavailable, api_exceptions.DeadlineExceeded, api_exceptions.InternalServerError, api_exceptions.TooManyRequests, asyncio.TimeoutError, ) # Replaces custom loop with a declarative tenacity decorator. @retry( stop=stop_after_attempt(config.max_retries), wait=wait_exponential(multiplier=1, min=config.min_delay, max=config.max_delay), retry=retry_if_exception_type(RETRYABLE_ERRORS), reraise=True ) async def _wrapped_call(): return await func(*args, **kwargs) return await _wrapped_call()

I am not a fan of adding all these environment variables. fsspec already has a way to specify instantiation kwargs using specially formatted environment variables or files ( https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration ).

Since retries are useful in multiple backends and have the same concepts (number of times, backoff factor, max wait, etc.), we could even make a general fsspec class for this.

codecov · 2026-03-31T14:54:51Z

Codecov Report

❌ Patch coverage is 97.14286% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.79%. Comparing base (1ec7a98) to head (5a19937).

Files with missing lines	Patch %	Lines
gcsfs/retry.py	96.55%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #787      +/-   ##
==========================================
+ Coverage   75.96%   76.79%   +0.82%     
==========================================
  Files          14       14              
  Lines        2663     2693      +30     
==========================================
+ Hits         2023     2068      +45     
+ Misses        640      625      -15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

martindurant

Is there anything about the retry logic here that's specific to gRPC? I doesn't appear so to me.

martindurant · 2026-03-31T15:09:34Z

gcsfs/retry.py

+def _get_retry_config(
+    retry_deadline=None, max_retries=None
+) -> StorageControlRetryConfig:
+    conf = fsspec.config.conf.get("gcs", {})


the purpose of the conf is to pass default values to the class constructor at instantiation, so there is not normally any need to query the conf directly, but the values should be passed in.

martindurant · 2026-03-31T15:10:59Z

gcsfs/retry.py

+) -> StorageControlRetryConfig:
+    conf = fsspec.config.conf.get("gcs", {})
+
+    return StorageControlRetryConfig(


As a dataclass, this already has these defaults.
I think you would be better of passing kwargs round, like in other places inthe codebase. The Config class isn't doing anything for you, except obscuring when the values are being set.

martindurant · 2026-03-31T15:12:23Z

gcsfs/retry.py

+    It passes timeout to the function itself and uses timeout + 1.0 for wait_for.
+    """
+    return await asyncio.wait_for(
+        func(*args, timeout=timeout, **kwargs), timeout=timeout + 1.0


Where does the arbitrary + 1.0 come from?
How doe we know that func takes a timeout parameter (and if it does, why do we need another one?) ?

The + 1.0 buffer: It gives the underlying gRPC library a 1-second grace period to cleanly abort and raise its own specific timeout error (e.g., DeadlineExceeded) before Python's asyncio aggressively kills the task.

How we know func takes timeout: This wrapper is specifically built for Google Cloud GAPIC (gRPC) client methods (like the ones in storage_control_v2 - create_folder, rename_folder). By standard design, all of these methods accept a timeout kwarg to set the RPC deadline.

Why we need both: The inner timeout is the actual RPC deadline sent to the server and gRPC core. The outer asyncio.wait_for is a hard fail-safe to ensure the Python event loop doesn't hang forever if the network layer freezes and fails to respect the inner deadline.

This is more specifically added based on the issues that were observed in GCSFuse related to request stalling.

martindurant · 2026-03-31T15:13:41Z

gcsfs/retry.py

+        reraise=True,
+    )
+
+    return await retryer(


I don't see that we need an extra library for this one thing. One might argue that this retry logic belongs in fsspec, but it would be fine to write it out here.

Are you referring to remove entire Tenacity dependency and go back to custom implementation?

Hi @martindurant,

I completely understand the hesitation around adding new dependencies—I’m usually of the same mind when it comes to keeping the project's footprint light. You're absolutely right that this logic might eventually be better suited for fsspec globally.

The main reason I proposed tenacity is that it has become an industry standard for production-grade retries. It handles the nuances of exponential backoff with jitter and async/sync compatibility right out of the box, which can be surprisingly tricky to get 100% right in a custom implementation.

While we can certainly roll a simpler custom version as you suggested, I thought using a battle-tested library might save us from 'reinventing the wheel' and maintenance overhead later. However, if the priority is keeping the dependency list short, we're happy to strip it back.

What do you think is the best balance for the project?

Mahalaxmibejugam · 2026-03-31T19:47:41Z

Is there anything about the retry logic here that's specific to gRPC? I doesn't appear so to me.

execute_with_timeout method explicitly injects timeout=timeout into the function call (func(*args, timeout=timeout, **kwargs)). This is specific to the signature of GAPIC gRPC storage_control_v2 methods

Implemented robust time-bound idempotent retry strategy for GCS stora…

94fd229

…ge control client folder operations.

Mahalaxmibejugam changed the title ~~Implemented rretries for Storage control client calls~~ Implemented retries for Storage control client calls Mar 25, 2026

Mahalaxmibejugam changed the title ~~Implemented retries for Storage control client calls~~ Implement retries for Storage control client calls Mar 25, 2026

Mahalaxmibejugam added 3 commits March 25, 2026 12:14

fix failing unit tests

5f3b445

Implement retry for get_storage_layout and add tests

654a20c

fix lint

9c4f633

jasha26 reviewed Mar 25, 2026

View reviewed changes

Add documentation for retries

4ee1476

jasha26 reviewed Mar 25, 2026

View reviewed changes

Mahalaxmibejugam added 5 commits March 25, 2026 17:50

fix test_get_bucket_type_retry

4da8bd9

update retries to use tenacity

152cec2

fix lint

b4bcf37

Merge remote-tracking branch 'upstream/main' into robust-gcsfs-retries

b00c246

fix test

5a19937

Mahalaxmibejugam force-pushed the robust-gcsfs-retries branch from 6c6461e to 5a19937 Compare March 31, 2026 14:42

martindurant reviewed Mar 31, 2026

View reviewed changes



		async def execute_with_timebound_retry(
		func, args, retry_deadline=30.0, max_retries=6, *kwargs

Conversation

Mahalaxmibejugam commented Mar 25, 2026

Implementation Details

Verification

Uh oh!

jasha26 Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jasha26 Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 31, 2026

Codecov Report

Uh oh!

martindurant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mahalaxmibejugam Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mahalaxmibejugam commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jasha26 Mar 25, 2026 •

edited

Loading

jasha26 Mar 25, 2026 •

edited

Loading

Mahalaxmibejugam Mar 31, 2026 •

edited

Loading