Skip to content

Prevent retry loop large uploads#2195

Merged
MarcusSorealheis merged 4 commits intomainfrom
prevent-retry-loop-large-uploads
Mar 2, 2026
Merged

Prevent retry loop large uploads#2195
MarcusSorealheis merged 4 commits intomainfrom
prevent-retry-loop-large-uploads

Conversation

@MarcusSorealheis
Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis commented Feb 28, 2026

Description

Large artifacts time out due to rpc timeout defaults being 120.

Fixes #2185

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Unit tests

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

amankrx and others added 3 commits February 27, 2026 00:44
Tests the conditional that converts NotFound errors containing
"not found in either fast or slow store" to FailedPrecondition,
and verifies other NotFound errors still return InternalError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The GrpcStore rpc_timeout_s defaulted to 120 seconds, which is too
short for multi-GB uploads. This caused DeadlineExceeded errors that
triggered retries, restarting the upload and compounding the problem.

Dead connections are already detected by HTTP/2 keepalive (30s ping,
20s timeout) and TCP keepalive (30s) on each endpoint, so the per-RPC
total timeout is unnecessary for that purpose.

Setting rpc_timeout_s=0 now correctly disables the timeout instead of
silently falling through to the 120s default.

Fixes #2185
@MarcusSorealheis MarcusSorealheis enabled auto-merge (squash) March 2, 2026 14:40
@MarcusSorealheis MarcusSorealheis merged commit 2a2ca64 into main Mar 2, 2026
28 checks passed
@MarcusSorealheis MarcusSorealheis deleted the prevent-retry-loop-large-uploads branch March 2, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large uploads gets stuck in a retry loop

3 participants