feat: add automatic S3 request retry with exponential backoff#1701
feat: add automatic S3 request retry with exponential backoff#1701allanrogerr wants to merge 18 commits intominio:masterfrom
Conversation
Implements SDK-level automatic retry matching the minio-go retry-mechanism spec. All requests through BaseS3Client.executeAsync() now retry on retryable S3 error codes (InternalError, SlowDown, RequestTimeout, etc.), retryable HTTP status codes (408, 429, 499, 500, 502, 503, 504, 520), and transient IO errors (connection reset, EOF, server closed idle connection). Non-seekable request bodies (raw okhttp3.RequestBody) get a single attempt; seekable bodies (byte[], ByteBuffer, RandomAccessFile) retry up to maxRetries (default 10). Backoff is full-jitter exponential: rand(0, min(1s, 200ms*2^n)) before the nth retry, matching minio-go's DefaultRetryUnit/DefaultRetryCap. The "RetryHead" S3 code is explicitly excluded so executeHeadAsync region redirect logic is unaffected. maxRetries is configurable via builder and setMaxRetries() post-construction. Closes minio#1700.
49fda52 to
9524c1f
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces SDK-level automatic retry for S3 requests with full-jitter exponential backoff, aligning the Java SDK’s behavior with minio-go’s retry policy and making transient S3/HTTP/IO failures less likely to leak to callers.
Changes:
- Added a new package-private
Retryhelper for retryability classification (S3 codes, HTTP status codes, IO exceptions) and jittered exponential backoff. - Wired a retry loop into
BaseS3Client.executeAsync()with delayed retries via a daemonScheduledExecutorService, plus a seekability-based single-attempt gate for non-replayable bodies. - Exposed configuration via
maxRetries(...)on builders andsetMaxRetries(...), and added anInvalidResponseException.responseCode()accessor plus a comprehensive newRetryTestsuite.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
api/src/main/java/io/minio/BaseS3Client.java |
Adds execute-with-retry loop, backoff scheduling, and max-retry configuration. |
api/src/main/java/io/minio/Retry.java |
Defines retryable error sets and backoff computation. |
api/src/main/java/io/minio/MinioAsyncClient.java |
Adds maxRetries(...) builder option and applies it to the built client. |
api/src/main/java/io/minio/MinioClient.java |
Exposes setMaxRetries(...) and maxRetries(...) on the synchronous client/builder. |
api/src/main/java/io/minio/Http.java |
Adds Http.S3Request.body() accessor used for retry seekability gating. |
api/src/main/java/io/minio/errors/InvalidResponseException.java |
Stores/returns the triggering HTTP response code for retry classification. |
api/src/test/java/io/minio/RetryTest.java |
Adds unit + MockWebServer integration tests covering retry classification and behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Http.Body now captures the initial file pointer (fileOffset) when
created for RandomAccessFile bodies; toRequestBody() seeks back to
that offset before each attempt, fixing both first-attempt and retry
payload corruption when checksum computation has advanced the pointer
- BaseS3Client.createBody() explicitly restores the file pointer after
computing checksums so Http.Body captures the correct starting offset
- Retry scheduler now dispatches actual retry work (credential fetch +
request build) to ForkJoinPool.commonPool(), freeing the single
scheduler thread from head-of-line blocking by slow providers
- Replace package-private {@link Retry#MAX_RETRY} Javadoc references in
MinioClient and MinioAsyncClient with {@code 10} to satisfy doclint
- Fix RETRYABLE_HTTP_CODES closing paren indentation to match codebase
style (ImmutableSet.of() in Http.java and Signer.java)
- Add testRandomAccessFileBodyRetried integration test: verifies that
both the failed and retried PUT requests carry identical body content
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…he.remove The no-retry tests for 404/403 were asserting only assertNotNull, which would pass for any exception including NPEs. Stronger assertions now verify ErrorResponseException type and S3 error code/HTTP status, and the retry-exhausted test verifies InvalidResponseException with status 500. While adding those assertions, the tests revealed a pre-existing NPE: regionCache.remove(s3request.bucket()) was called unconditionally when the S3 error code was NoSuchBucket/RetryHead, but bucket is null for operations like listBuckets(). Added a null guard so the cache invalidation is skipped when there is no bucket in the request.
…gs JUA SpotBugs JUA flags Assert.assertTrue(e instanceof T) because the message hides what type was actually thrown on failure. Using a typed catch block is stricter: any unexpected exception type propagates as a test error with the full stack trace, which is more informative than a failed boolean assertion.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…strengthen test assertions Replace the shared static Random instance with ThreadLocalRandom.current() at the backoff call site and for the fan-out name generator in MinioAsyncClient — eliminating shared-state CAS contention under concurrent retry load. Disable OkHttp's retryOnConnectionFailure in doExecuteAsync so the SDK's executeWithRetry is the sole retry policy; previously OkHttp could silently add an extra attempt on stale-connection IOExceptions, causing more total attempts than maxRetries. Strengthen the two remaining assertNotNull catch blocks in RetryTest (testMaxRetriesOneDisablesRetry, testSetMaxRetriesPostConstruction) to typed InvalidResponseException catches with responseCode()==500 assertions, matching the pattern applied to the other tests in the previous commit.
Per @balamurugana's review on PR minio#1701, replace the CompletableFuture/scheduler-based retry orchestration with an OkHttp interceptor (Http.RetryInterceptor) that retries on retryable HTTP status codes (408/429/499/500/502/503/504/520) and IOExceptions with exponential backoff and full jitter. - Drop S3-error-code retry classification; well-behaved S3/MinIO servers return retryable conditions with non-2xx HTTP status, so RETRYABLE_S3_CODES is redundant. - Drop RETRY_SCHEDULER and the executeWithRetry recursive chain. - Non-replayable bodies (raw okhttp3.RequestBody) are sent once via Http.RequestBody.isReplayable(); byte[]/ByteBuffer/RandomAccessFile bodies replay safely because writeTo rewinds the source. - Wire interceptor at construction (BaseS3Client.wrapWithRetry) so setMaxRetries() takes effect on subsequent requests.
Replace `throws Exception` with the actually-thrown checked exceptions (`IOException, MinioException`, plus `InterruptedException` for tests calling `MockWebServer.takeRequest()`). Drops `throws` entirely on `testSetMaxRetriesValidation` since it throws only an unchecked `IllegalArgumentException`. Fixes spotbugsTest THROWS_METHOD_THROWS_CLAUSE_BASIC_EXCEPTION findings on Java 8/11/17/21/25.
Per @balamurugana's review on PR minio#1701, replace the CompletableFuture/scheduler retry orchestration with a self-contained OkHttp interceptor that retries on retryable HTTP status codes with exponential backoff and full jitter. - Http.RetryInterceptor: hard-coded MAX_RETRIES=5, BASE_DELAY_MS=200, MAX_DELAY_MS=30s, RETRYABLE_STATUS_CODES = 408/429/499/500/502/503/ 504/520. Bit-shift exponent capped to avoid overflow at high attempt counts; jitter bound guarded with Math.max(1, cap). - Wired into Http.newDefaultClient() with a single addInterceptor call. - Drop public maxRetries API on MinioClient/MinioAsyncClient/BaseS3Client and Builder.maxRetries; the interceptor's params are constants. - Drop Retry.java entirely — RETRYABLE_HTTP_CODES inlined; the S3-error- code retry set, isRetryable(Throwable), isRetryableIOException, and computeBackoffMs helpers are no longer needed. - IOException retry now relies on OkHttp's default retryOnConnectionFailure(true); only the interceptor handles status codes. No double-retry, no overlap. - Drop Http.RequestBody.isReplayable() and the body-replayability gate; the raw okhttp3 RequestBody body path is unused in production code, and byte[]/ByteBuffer/RandomAccessFile bodies replay safely via writeTo(). - Drop InvalidResponseException.responseCode() accessor (added solely for the deleted retry classifier). - RetryTest: trim from 539 lines to two integration smoke tests (retry on 503->200, no retry on 404). Validated locally: - :api:check (unit + spotbugs + spotless) green on Java 17 - :functional:runFunctionalTest passes through CORS test (modulo a pre-existing CORSConfiguration.equals() bug unrelated to this PR) - mint suite (overlayed JARs in mint container, erasure mode): "All tests ran successfully", PASS in 42.8s
Reinstate the retry triggers, classifiers, and configurability that the strip-back trimmed away, while keeping the orchestration in a single self-contained OkHttp interceptor. - Retry.java: status-code/S3-code sets, IOException filter, and the full-jitter exponential-backoff formula. Defaults: MAX_RETRY=10, DEFAULT_RETRY_UNIT_MS=200, DEFAULT_RETRY_CAP_MS=1000, MAX_JITTER=1.0. - Http.RetryInterceptor: retries on retryable IOException, retryable HTTP status, and retryable S3 <Code> in non-2xx response bodies (peek capped at 5 MiB). Per-instance attempt budget via IntSupplier. - BaseS3Client.wrapWithRetry installs the interceptor on every supplied client (default or user-passed) and disables OkHttp's retryOnConnectionFailure to avoid double retry. - Restore the public maxRetries API: MinioClient.setMaxRetries, MinioClient.Builder.maxRetries, MinioAsyncClient.Builder.maxRetries, BaseS3Client.setMaxRetries. Pass 1 to disable retries. - Drop the dead Http.S3Request.body() getter that was orphaned by the earlier strip-back. RetryTest expanded from 2 to 22 tests covering the helpers, the interceptor, every retryable status code, S3-code retry on a non-2xx body, retry exhaustion, mid-sequence retry, post-construction setter, and builder validation. Validated locally on Java 17: - :api:check (compile + spotless + spotbugs + tests) PASS - ./gradlew build (api, adminapi, examples, functional) PASS - mint suite, erasure mode, JARs from this commit: PASS in 44.3s
…status assertions Address Copilot review feedback (PR minio#1701, review 4246873092): - RetryInterceptor Javadoc now declares it as part of the SDK's supported public API and documents how BaseS3Client wraps every supplied OkHttpClient. Drops {@link} references to package-private Retry helpers so the public Javadoc no longer points at internal symbols. Adds a Threading note about Thread.sleep blocking the OkHttp dispatcher slot during backoff and points high-concurrency callers at Dispatcher pool sizing. - BaseS3Client.executeAsync Javadoc reworded to reflect the wrapWithRetry behaviour (interceptor installed on every supplied client, default or caller-provided) and to note the maxRetries supplier is read per request. - RetryTest: strengthen testRetryExhaustedReturnsLastResponse to assert the InvalidResponseException reflects HTTP 500 rather than allowing any exception type to pass. Add testRetryExhaustedSurfacesXmlErrorResponse for the XML/ErrorResponseException path, asserting both response.code() and the parsed S3 code after retries are exhausted. Validated locally on Java 17: - :api:check (compile + spotless + spotbugs + tests) PASS - :api:javadoc PASS
… tests
Address findings from the deep-review pass on the retry feature.
Behaviour
- RetryInterceptor.intercept now consults chain.call().isCanceled() at the
top of each attempt and after every IOException from chain.proceed.
Without this a cancelled call would surface IOException("Canceled"),
fall through Retry.isRequestErrorRetryable as "retryable", and burn the
full attempt budget in sleep+retry — a regression in real OkHttp
cancellation flows. Mirrors minio-go's `errors.Is(err, context.Canceled)`
short-circuit.
API hardening
- BaseS3Client.maxRetries narrowed from protected to package-private; the
only out-of-class writer is MinioAsyncClient.Builder (same package),
which now goes through the validating setMaxRetries() instead of a raw
field assignment.
- Http.Body(okhttp3.RequestBody) now validates non-null and documents the
retry-replay caveat for one-shot bodies.
Documentation
- RetryInterceptor Javadoc adds explicit notes on cancellation handling,
request-body replayability (with a warning about caller-supplied
isOneShot bodies), and the deliberate divergence from minio-go around
per-attempt request rebuild / re-signing / region & STS refresh.
- BaseS3Client.wrapWithRetry Javadoc states that retryOnConnectionFailure
is forced to false on every supplied client.
- BaseS3Client.traceOn Javadoc notes per-retry attempts are not surfaced
to the trace stream and points at OkHttp's HttpLoggingInterceptor.
- MinioAsyncClient.Builder.httpClient overloads now document the wrap
behaviour.
- Retry.java header references minio-go's retry.go as the contract source.
- docs/API.md builder-methods table gains a maxRetries() row.
Tests (RetryTest.java)
- All 13 integration tests now close the MinioClient via a closeQuietly
helper, eliminating the per-test dispatcher leak.
- New: testCancelDuringRetryStopsLoop — wraps a real OkHttpClient with the
interceptor, fires Call.cancel() ~150ms in, asserts the loop short-
circuits in <5s (without the fix it would run all 10 attempts).
- New: testRetryOnConnectionDropThenSuccess — exercises the IOException
retry path end-to-end via MockWebServer SocketPolicy.DISCONNECT_AT_START.
- New: testUserSuppliedClientStillRetries — caller-supplied OkHttpClient
with retryOnConnectionFailure(true) still gets the SDK interceptor.
- New: testIsRequestErrorRetryable_certPathErrorsNotRetryable — covers
the SSLException-with-cert-path-cause and SSLPeerUnverifiedException
branches that were previously untested.
- New: testMaxRetriesBuilderValidation_negative,
testSetMaxRetriesValidation_negative — extends validation coverage
beyond the 0-only case.
Style
- Replaced fully-qualified java.util.regex.Pattern/Matcher with imports
on Http.java.
- Bumped 2025-only copyright headers on touched files to 2025-2026 (and
2022 -> 2022-2026, 2015-2021 -> 2015-2026).
Validated locally on Java 17:
- :api:check (compile + spotless + spotbugs + tests) PASS
- :api:javadoc PASS
- ./gradlew build (api, adminapi, examples, functional) PASS
minio-go contract: cancellation handling is *added*, bringing parity with
minio-go's errors.Is(context.Canceled) check. No retryable status, S3 code,
backoff formula, or jitter constant was changed.
… prior interceptor from network chain too; correct backoff range docs Address Copilot review (PR minio#1701, review 4247997916): - BaseS3Client.setMaxRetries and MinioAsyncClient.Builder.maxRetries Javadocs no longer reference {@link Retry#MAX_RETRY}. The Retry class is package-private, so the link target was unresolvable from generated public docs. Replaced with the literal default {@code 10}, which is what callers actually see. - BaseS3Client.wrapWithRetry now also strips any prior RetryInterceptor from networkInterceptors(), not only the application-interceptor list. Previously the Javadoc claimed every prior RetryInterceptor was replaced, but only application interceptors were. A caller that had (mis)registered RetryInterceptor as a network interceptor would have ended up with two retry layers. Defends against that and brings the Javadoc claim into line with the code. - Retry.exponentialBackoffMs Javadoc previously stated the result range as [0, sleep]. The actual range is [1, sleep] because nextDouble() is in [0.0, 1.0) and the (long) cast truncates rand * sleep to at most sleep - 1, so the subtraction never reaches sleep itself. minio-go has the same off-by-one (Float64() is also [0.0, 1.0)); this is documented now rather than papered over, so callers reading the formula see the same bounds the implementation produces. Validated locally on Java 17: - :api:check (compile + spotless + spotbugs + tests) PASS - :api:javadoc PASS
| @@ -0,0 +1,136 @@ | |||
| /* | |||
There was a problem hiding this comment.
Retry.java is a technically correct, direct port from minio-go. Removing it would disrupt that alignment.
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Strictly speaking none of this is needed. I am seeking to maintain alignment with minio-go as much as possible. It is a single source of truth, with pure-function testability, policy versus mechanism separation, as well as parity with minio-go/retry.go
Perhaps more substantive: it is the SDK's only retry policy: with an HTTP-status set, IOException classifier, and jitter formula - none of which OkHttp provides. They are kept in one package-private class so the synch path and the interceptor can't drift, and so the predicates are unit-testable as pure functions.
As an alternative I could inline the four policy pieces as private statics inside Http.RetryInterceptor. What do you prefer @balamurugana ?
There was a problem hiding this comment.
Please remove it. We should not copy/convert go code. We need to use each language features.
There was a problem hiding this comment.
Spoke to Bala - I am removing this entire file and refactoring to suit.
Five concessions from bala's review batch — each verified against minio-go's contract or the file's own pre-PR conventions: BaseS3Client.java - Restore the `protected static final Random RANDOM` field (with its java.security.SecureRandom + java.util.Random imports) that 60aeaed dropped when switching backoff jitter to ThreadLocalRandom. Per bala's r3205617158: do not remove protected/public API surface. The field is part of the public subclassing contract on BaseS3Client and is the Java analogue of minio-go's per-client `c.random` source (api.go:307); SDK-internal code paths still use ThreadLocalRandom.current() for the retry backoff hot path so jitter is unchanged. - Drop the null-guard around `regionCache.remove(s3request.bucket())` introduced in beead6e. The guard was only load-bearing for the contrived testNoRetryOn404 (listBuckets + 404 NoSuchBucket — a wire shape no spec-compliant S3 server emits since NoSuchBucket only applies to bucket-scoped ops); the test is rewritten in this commit to a realistic scenario, removing the dependency on the guard. Http.java - Drop the two-line `// Our RetryInterceptor handles transient failures with full-jitter backoff; defer entirely to it instead of layering OkHttp's connection-level retry on top.` comment from newDefaultClient() per bala's r3205589355: the retryOnConnectionFailure(false) call paired with addInterceptor(new RetryInterceptor()) is self-explanatory and the prose duplicated the RetryInterceptor Javadoc. - Drop the `Utils.validateNotNull(requestBody, "request body")` guard from the Body(okhttp3.RequestBody) constructor per bala's r3205613399; the three other Body constructors do not validate non-null and let the next-line dereference fail with the JVM's helpful-NPE message — restoring that local convention. - Drop the `private long fileOffset` field, its capture in the Body(RandomAccessFile,...) constructor, and the seek in toRequestBody() per bala's r3205614300. Empirical truth-table probe (4 trials toggling F = Body.fileOffset against C+R = createBody save/restore + RequestBody.position+seek) confirmed F is redundant when C+R are present: removing F alone keeps both first-attempt and retry-attempt body content correct (Trial 3), since the pre-Body-construction seek in createBody already places the file pointer where RequestBody.position will capture it on construction and writeTo() will seek back to on each network attempt. RetryTest.java - Rewrite testNoRetryOn404 to call removeBucket on a non-existent bucket. NoSuchBucket only applies to bucket-scoped operations on the S3 wire; pairing it with the bucket-less listBuckets() is a contrived scenario AWS S3 / MinIO never produce. The rewrite preserves coverage (404 does not retry, ErrorResponseException carries code "NoSuchBucket" and status 404, request count == 1) while exercising a wire shape the SDK actually has to handle, and uses a non-null bucket so the regionCache invalidation path runs with a real ConcurrentHashMap.remove(String) call. Validated locally on Java 17: - :api:check (compile + spotless + spotbugs + tests) PASS - :api:javadoc PASS
| @@ -0,0 +1,136 @@ | |||
| /* | |||
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
| } catch (IOException e) { | ||
| throw new MinioException(e); | ||
| } | ||
| } |
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
Five reviewer threads addressed plus a WHY-only comment trim across the
touched files. Each thread is mapped to the reviewer's discussion ID and
verified locally against :api:check / :api:javadoc / a production-cluster
repro on the local mint-erasure compose.
r3205524097 / r3206746125 ("we should not wrap")
- BaseS3Client.java: drop wrapWithRetry() and the matching call sites in
both ctors. Caller-supplied OkHttpClient is now used verbatim — no
newBuilder(), no interceptor strip-and-add, no retryOnConnectionFailure
override — per bala's principle that the user's client is the user's
client. Field/Javadoc updates to setMaxRetries, executeAsync, the
copy-ctor, and the maxRetries field reflect the new contract.
- Http.java: add newDefaultClient(IntSupplier) overload so the SDK-default
client can have its retry budget bound to BaseS3Client.maxRetries via a
supplier. The no-arg form delegates with () -> Retry.MAX_RETRY for
backward compatibility (MinioAdminClient, functional/TestArgs).
- MinioAsyncClient.java: Builder.build() constructs the client first, then
assigns httpClient via the supplier-bound newDefaultClient when the
caller didn't supply one. Builder.httpClient(...) and Builder.maxRetries
Javadocs spell out the new caller-supplied-client contract.
- RetryTest.java: replace testUserSuppliedClientStillRetries (which
exercised the now-gone wrap behaviour) with two contract tests:
testUserSuppliedClientWithoutInterceptorDoesNotRetry and
testUserSuppliedClientWithInterceptorRetries.
r3205557447 (thread-safety on this.httpClient reassignment)
- BaseS3Client.java:114: protected OkHttpClient httpClient ->
protected volatile OkHttpClient httpClient. Matches the precedent of
volatile int maxRetries on the same class. Provides JMM happens-before
between writers (setTimeout / ignoreCertCheck reassigning the field)
and readers (executeAsync). The local-snapshot pattern bala originally
proposed does not solve the visibility problem; the field qualifier
does. The "no two timeouts mid-call" property bala raised separately
is already guaranteed by OkHttp Call binding (a Call captures its
OkHttpClient at newCall time and the chain reuses it for retries).
r3205580978 ("createBody save/restore not needed")
- Checksum.java: rewrite the RandomAccessFile branch of update() to use
positional FileChannel.read(ByteBuffer, long) so the method no longer
mutates the file pointer. Side-effect-free helper now; callers don't
need to bracket with getFilePointer / seek.
- BaseS3Client.java: drop the createBody fileStartPos capture and the
matching seek-back. Empirically verified on a live MinIO erasure
cluster: client.putObject(PutObjectAPIArgs) with a RandomAccessFile
body and no pre-set checksum headers now completes successfully (the
prior incorrect "remove these lines" world fails with EOFException
because Http.RequestBody captures the post-checksum EOF position).
- MinioAsyncClient.java: drop the two cousin save/restore patterns
around Checksum.update at lines ~2010 and ~3560 — both are now
redundant for the same reason.
r3205604391 ("don't probe S3 error codes")
- Retry.java: drop RETRYABLE_S3_CODES (12-element set) and
isS3CodeRetryable(String). Class is now narrower: HTTP status set,
IOException classifier, and full-jitter backoff formula. minio-go-
parity comment removed since the SDK no longer mirrors that list.
- Http.java: drop MAX_PEEK_BYTES, S3_ERROR_CODE_PATTERN, the
peekS3ErrorCode helper, and the body-probing branch in
RetryInterceptor.intercept. RetryInterceptor Javadoc updated to drop
the S3-code bullet from "Retries on:". Unused
java.util.regex.Pattern import removed (Matcher is still used
elsewhere in the file).
- RetryTest.java: drop testIsS3CodeRetryable_retryable,
testIsS3CodeRetryable_notRetryable, and the integration test
testRetryOnRetryableS3CodeIn400Body. 28 RetryTest cases remain,
all PASS.
r3206792852 ("we already have RANDOM")
- MinioAsyncClient.java:96, 3382: drop the
java.util.concurrent.ThreadLocalRandom import and swap
ThreadLocalRandom.current() for the inherited
BaseS3Client.RANDOM at the fan-out object-name suffix site. The
ThreadLocalRandom usage in Retry.exponentialBackoffMs is a
separate, hot-path concern (added in 60aeaed to address Copilot
review 3191258329 about RANDOM contention); bala did not flag it
and it remains.
Comment trim
- WHY-only comments across the six files: removed restatements of
what the code does, kept only non-obvious rationale (order
constraints, cross-class contracts, gotchas). Javadocs trimmed to
the user-facing contract; mechanics moved out.
Validated on Java 17:
- :api:check (compile + spotless + spotbugs + 28 RetryTest cases) PASS
- :api:javadoc PASS
- Live mint-erasure compose cluster + ReproUploadObject driving the
no-pre-set-checksum-headers PUT path: PASS
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five reviewer threads addressed plus a WHY-only comment trim across the
touched files. Each thread is mapped to the reviewer's discussion ID and
verified locally against :api:check / :api:javadoc / a production-cluster
repro on the local mint-erasure compose.
r3205524097 / r3206746125 ("we should not wrap")
- BaseS3Client.java: drop wrapWithRetry() and the matching call sites in
both ctors. Caller-supplied OkHttpClient is now used verbatim — no
newBuilder(), no interceptor strip-and-add, no retryOnConnectionFailure
override — per bala's principle that the user's client is the user's
client. Field/Javadoc updates to setMaxRetries, executeAsync, the
copy-ctor, and the maxRetries field reflect the new contract.
- Http.java: add newDefaultClient(IntSupplier) overload so the SDK-default
client can have its retry budget bound to BaseS3Client.maxRetries via a
supplier. The no-arg form delegates with () -> Retry.MAX_RETRY for
backward compatibility (MinioAdminClient, functional/TestArgs).
- MinioAsyncClient.java: Builder.build() constructs the client first, then
assigns httpClient via the supplier-bound newDefaultClient when the
caller didn't supply one. Builder.httpClient(...) and Builder.maxRetries
Javadocs spell out the new caller-supplied-client contract.
- RetryTest.java: replace testUserSuppliedClientStillRetries (which
exercised the now-gone wrap behaviour) with two contract tests:
testUserSuppliedClientWithoutInterceptorDoesNotRetry and
testUserSuppliedClientWithInterceptorRetries.
r3205557447 (thread-safety on this.httpClient reassignment)
- BaseS3Client.java:114: protected OkHttpClient httpClient ->
protected volatile OkHttpClient httpClient. Matches the precedent of
volatile int maxRetries on the same class. Provides JMM happens-before
between writers (setTimeout / ignoreCertCheck reassigning the field)
and readers (executeAsync). The local-snapshot pattern bala originally
proposed does not solve the visibility problem; the field qualifier
does. The "no two timeouts mid-call" property bala raised separately
is already guaranteed by OkHttp Call binding (a Call captures its
OkHttpClient at newCall time and the chain reuses it for retries).
r3205580978 ("createBody save/restore not needed")
- Checksum.java: rewrite the RandomAccessFile branch of update() to use
positional FileChannel.read(ByteBuffer, long) so the method no longer
mutates the file pointer. Side-effect-free helper now; callers don't
need to bracket with getFilePointer / seek.
- BaseS3Client.java: drop the createBody fileStartPos capture and the
matching seek-back. Empirically verified on a live MinIO erasure
cluster: client.putObject(PutObjectAPIArgs) with a RandomAccessFile
body and no pre-set checksum headers now completes successfully (the
prior incorrect "remove these lines" world fails with EOFException
because Http.RequestBody captures the post-checksum EOF position).
- MinioAsyncClient.java: drop the two cousin save/restore patterns
around Checksum.update at lines ~2010 and ~3560 — both are now
redundant for the same reason.
r3205604391 ("don't probe S3 error codes")
- Retry.java: drop RETRYABLE_S3_CODES (12-element set) and
isS3CodeRetryable(String). Class is now narrower: HTTP status set,
IOException classifier, and full-jitter backoff formula. minio-go-
parity comment removed since the SDK no longer mirrors that list.
- Http.java: drop MAX_PEEK_BYTES, S3_ERROR_CODE_PATTERN, the
peekS3ErrorCode helper, and the body-probing branch in
RetryInterceptor.intercept. RetryInterceptor Javadoc updated to drop
the S3-code bullet from "Retries on:". Unused
java.util.regex.Pattern import removed (Matcher is still used
elsewhere in the file).
- RetryTest.java: drop testIsS3CodeRetryable_retryable,
testIsS3CodeRetryable_notRetryable, and the integration test
testRetryOnRetryableS3CodeIn400Body. 28 RetryTest cases remain,
all PASS.
r3206792852 ("we already have RANDOM")
- MinioAsyncClient.java:96, 3382: drop the
java.util.concurrent.ThreadLocalRandom import and swap
ThreadLocalRandom.current() for the inherited
BaseS3Client.RANDOM at the fan-out object-name suffix site. The
ThreadLocalRandom usage in Retry.exponentialBackoffMs is a
separate, hot-path concern (added in 60aeaed to address Copilot
review 3191258329 about RANDOM contention); bala did not flag it
and it remains.
Comment trim
- WHY-only comments across the six files: removed restatements of
what the code does, kept only non-obvious rationale (order
constraints, cross-class contracts, gotchas). Javadocs trimmed to
the user-facing contract; mechanics moved out.
Validated on Java 17:
- :api:check (compile + spotless + spotbugs + 28 RetryTest cases) PASS
- :api:javadoc PASS
- Live mint-erasure compose cluster + ReproUploadObject driving the
no-pre-set-checksum-headers PUT path: PASS
14bcc29 to
93e4f47
Compare
|
All comments addressed |
Three changes: (1) Retry.java is folded into Http.RetryInterceptor as package-private static members (MAX_RETRY, DEFAULT_RETRY_UNIT_MS, DEFAULT_RETRY_CAP_MS, MAX_JITTER, RETRYABLE_HTTP_STATUS_CODES, isHttpStatusRetryable, isRequestErrorRetryable, exponentialBackoffMs). The class was a one-use container for the interceptor's helpers; folding shrinks the package surface. RetryTest call sites updated to Http.RetryInterceptor.X. (2) BaseS3Client.createBody save/restore bracket dropped (r3206778779, "It is not needed for Java"). For the file-PUT paths that go through MinioAsyncClient, the upstream brackets at L1995/L3540 leave the file pointer at start-of-payload, and the pre-computed Checksum.makeHeaders headers short-circuit every Checksum.update guard inside createBody -- so the pointer never moves there. Http.RequestBody.writeTo seeks to the captured start-of-payload on every attempt, so the retry interceptor's repeated writeTo calls replay the same bytes. createBody is now byte-identical to master. (3) Checksum.java reverted to master form. The positional FileChannel.read(ByteBuffer, long) variant added in 93e4f47 exists only to make the bracket in createBody redundant; once the bracket is gone, the variant is no longer needed and Checksum.update returns to the master file.read(buf, off, len) form. MinioAsyncClient L1995/L3540 brackets remain (they match master verbatim). Empirical validation -- two new tests in RetryTest: - testRetryWithFileBody_replaysSameBytesOnEveryAttempt: 8 KiB random payload, MockWebServer queue 503/503/503/200; asserts 4 PUT requests recorded and each body is byte-identical to the original payload. - testRetryWithFileBody_transportFailureReplaysFromStart: 4 KiB payload, DISCONNECT_AT_START followed by 200; asserts the retry attempt's body matches the original payload after seek-back. :api:check (compile + spotless + spotbugs + 30 RetryTest cases) PASS :api:javadoc PASS
c2224b1 to
0ae488b
Compare
Description
Adds SDK-level automatic retry on transient HTTP failures via a single OkHttp interceptor (
Http.RetryInterceptor) installed on the SDK's defaultOkHttpClient(built byHttp.newDefaultClient()/Http.newDefaultClient(IntSupplier)). The interceptor retries on retryable HTTP status codes and retryable transport-levelIOExceptions, then returns the final response (or the terminal exception). The retry budget is exposed asBuilder.maxRetries(int)/setMaxRetries(int)and defaults to 10 attempts.Caller-supplied clients are used verbatim. A custom
OkHttpClientpassed viaBuilder.httpClient(...)is not modified: no interceptor injection, noretryOnConnectionFailureoverride, no rebuild vianewBuilder(). To opt into the SDK's retry policy on a custom client, registerHttp.RetryInterceptoryourself and pair it withretryOnConnectionFailure(false).Builder.maxRetries(...)andMinioAsyncClient.setMaxRetries(...)are no-ops on the caller-supplied path unless the caller has wired the interceptor to read from there.On the SDK-default path,
OkHttpClient.retryOnConnectionFailure(false)is set so the SDK's interceptor is the single source of retry policy — no double-retry layered on top. The interceptor's attempt budget is bound to the per-instanceBaseS3Client.maxRetriesfield via anIntSuppliersosetMaxRetries(int)takes effect on the very next request.Retryable HTTP status codes (8): 408, 429, 499, 500, 502, 503, 504, 520.
Retryable transport errors: connection reset, EOF, socket timeout, idle-connection close, etc. SSL handshake / unknown-CA / cert-path errors and the HTTP-on-HTTPS protocol mismatch are explicitly not retried.
Backoff: full-jitter exponential, base 200 ms, per-attempt cap 1 s, exponent capped at 30 to avoid bit-shift overflow. Result range per attempt is
[1, min(cap, base · 2ⁿ)].Cancellation:
chain.call().isCanceled()is consulted before each attempt and after every IOException so cancelled calls do not burn the retry budget.Region/HEAD redirect: untouched —
executeHeadAsyncregion rediscovery still works.Public API:
Retry(helper class) is package-private; the public surface isHttp.RetryInterceptor,Http.newDefaultClient(...), and the builder/setter methods.Side-effect-free Checksum.update
Checksum.update'sRandomAccessFilebranch was rewritten to use positionalFileChannel.read(ByteBuffer, long position)so the method no longer mutates the file pointer. As a result,BaseS3Client.createBodyand the two cousin patterns insideMinioAsyncClient.putObjectno longer need to bracket their checksum reads withgetFilePointer()/seek(). Net effect: ~25 lines deleted across three call sites, callers simpler, behaviour identical end-to-end.minio-go alignment
The retry contract is intentionally aligned with
minio-go/retry.go:MaxRetryDefaultRetryUnitDefaultRetryCapMaxJittermin(cap, unit·2ⁿ) − rand·…Two deliberate divergences:
SlowDown/InternalError/Throttling→ 503/500), so observable coverage is unchanged in real-world workloads. Removed: 5 MiB body peek, XML regex parse, the 12-element S3-code set.okhttp3.Requestbecause retries run inside the OkHttp interceptor layer. For the default attempt budget the worst-case wall-clock retry window is well under the 15-minute S3 signing window. Documented inline onHttp.RetryInterceptor.Motivation and context
The SDK had no retry capability for transient HTTP failures. Throttling (503), gateway errors (502/504), request timeouts (408), and Cloudflare 520s propagated directly to callers, requiring every consumer to implement its own retry loop.
This PR went through several iterations driven by reviewer feedback:
CompletableFuture-based retry loop with a dedicated scheduler thread.BaseS3Client.httpClientisvolatilefor safe publication acrosssetTimeout/ignoreCertCheckreassignments (r3205557447).Checksum.updatemade side-effect-free on the file pointer;createBodysave/restore (and two cousins inMinioAsyncClient.putObject) dropped (r3205580978).BaseS3Client.RANDOMfor fan-out object-name suffix (r3206792852).Does not address #1700: error code
BucketNotEmpty(HTTP 409) is not natively retryable in minio-go and stays out of the retryable set here.How to test this PR
Unit + integration + lint (Java 17):
28
RetryTestcases: eachRetrypredicate, the backoff formula, and MockWebServer integration tests covering retry per status code, no-retry on 4xx, retry exhaustion (bothInvalidResponseExceptionandErrorResponseExceptionpaths),maxRetries(1)disables retry,setMaxRetriespost-construction, builder validation (0 and -1), connection-drop IOException retry, caller-supplied client contracts (testUserSuppliedClientWithoutInterceptorDoesNotRetryandtestUserSuppliedClientWithInterceptorRetries), cert-path SSLException not retryable, and a 10 s timeout-guarded test that firesCall.cancel()mid-retry-loop and asserts the loop short-circuits in <5 s.Functional tests against a real server:
Mint suite via Docker:
Verify the caller-supplied-client contract (no interceptor injection,
retryOnConnectionFailureleft alone):To opt into SDK retry on a custom client, register the interceptor explicitly:
Both behaviours are locked in by
RetryTest.Validation performed
:api:check(compile + spotless + spotbugs + 28RetryTestcases, Java 17):api:javadoc./gradlew build(api, adminapi, examples, functional):functional:runFunctionalTestagainst play.min.ioTestMinioClient.testBucketCorsonplay.min.iois unrelated to this PR —CORSConfigurationlacks anequalsoverride, and the branch does not touch CORS code)registry.min.dev/public/mint:aistorimage)Checksum.updatechange:client.putObject(PutObjectAPIArgs)with aRandomAccessFilebody and no pre-set checksum headersEOFExceptionafter 0 bytes — confirms the fix is load-bearing on this code path)