-
Notifications
You must be signed in to change notification settings - Fork 173
Implement retries for Storage control client calls #787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
94fd229
5f3b445
654a20c
9c4f633
4ee1476
4da8bd9
152cec2
b4bcf37
b00c246
5a19937
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,6 +7,7 @@ | |
| import google.auth.exceptions | ||
| import requests.exceptions | ||
| from decorator import decorator | ||
| from google.api_core import exceptions as api_exceptions | ||
|
|
||
| logger = logging.getLogger("gcsfs") | ||
|
|
||
|
|
@@ -176,3 +177,61 @@ async def retry_request(func, retries=6, *args, **kwargs): | |
| continue | ||
| logger.exception(f"{func.__name__} non-retriable exception: {e}") | ||
| raise e | ||
|
|
||
|
|
||
| async def execute_with_timebound_retry( | ||
| func, *args, retry_deadline=30.0, max_retries=6, **kwargs | ||
|
||
| ): | ||
| """ | ||
| Executes a gRPC storage control API call with a strict per-attempt timeout and an overall | ||
| maximum number of retries. Transient errors and timeouts will trigger an exponential backoff loop. | ||
| """ | ||
| attempt = 0 | ||
| while True: | ||
| try: | ||
| # We enforce a per-call timeout by passing `timeout=retry_deadline` to the API call. | ||
| # asyncio.wait_for serves as a hard local fallback to cancel the task if the gRPC timeout fails to abort. | ||
| return await asyncio.wait_for( | ||
| func(*args, timeout=retry_deadline, retry=None, **kwargs), | ||
| timeout=retry_deadline + 1.0, | ||
| ) | ||
| except Exception as e: | ||
| # Determine if the exception is transient and should be retried. | ||
| is_transient = isinstance( | ||
| e, | ||
| ( | ||
| api_exceptions.RetryError, | ||
| api_exceptions.DeadlineExceeded, | ||
| api_exceptions.ServiceUnavailable, | ||
| api_exceptions.InternalServerError, | ||
| api_exceptions.TooManyRequests, | ||
| api_exceptions.ResourceExhausted, | ||
| api_exceptions.Unknown, | ||
| asyncio.TimeoutError, | ||
| ), | ||
| ) | ||
|
|
||
| # Workaround: retry on 401s / Unauthenticated during transient token lapses | ||
| if ( | ||
| not is_transient | ||
| and isinstance(e, api_exceptions.Unauthenticated) | ||
| and "Invalid Credentials" in str(e) | ||
| ): | ||
| is_transient = True | ||
|
|
||
| if not is_transient: | ||
| raise e | ||
|
|
||
| attempt += 1 | ||
|
|
||
| if max_retries is not None and attempt >= max_retries: | ||
| logger.exception( | ||
| f"{func.__name__} out of max retries ({max_retries}) on exception: {e}" | ||
| ) | ||
| raise e | ||
|
|
||
| sleep_time = min(random.random() + 2 ** (attempt - 1), 32) | ||
| logger.debug( | ||
| f"{func.__name__} retrying (attempt {attempt}) after {sleep_time:.2f}s due to exception: {e}" | ||
| ) | ||
| await asyncio.sleep(sleep_time) | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the custom asyncio.wait_for logic, using a library like tenacity would look like this. Also did we evaluate google.api_core AsyncRetry:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need asyncio.wait_for logic on the client side to make sure that we handle the request stalls and won't wait indefinitely for the call to return. Replaced the custom logic with tenacity as it provides in-built support for retries. AsyncRetry also provides the same functionality but would still need the asyncio.wait_for logic on the client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need another dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a similar opinion, but @jasha26 recommended using Tenacity as it might help in future integrations like client-side throttling
AsyncRetry from google.api_core supports retries based on max_timeout instead of max_retries(which is followed in gcsfs for other JSON API retries). To keep the same retry behaviour for JSON APIs and storage control client calls(i.e., limiting the number of retries) I have implemented the custom logic. To use AsyncRetry with the constraint on number of attempts we would have to maintain a wrapper to track the number of attempts which would be almost same as the initial version of this PR without adding much benefit of using AsyncRetry
So we can either have entirely custom implementation or use Tenacity if we want to maintain the max_attempts behaviour across GCSFS