Skip to content

Fix bug in MultipartS3AsyncClient GetObject Retryable Errors #6309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

davidh44
Copy link
Contributor

@davidh44 davidh44 commented Jul 31, 2025

Motivation and Context

This PR fixes a bug in the Java-based multipart S3 client (MultipartS3AsyncClient) that causes incorrect retry behavior and duplicate request logging during GetObject operations.

Currently, when a retryable error, e.g., 503 Slowdown, is returned by S3, the SDK may do one of two things:

  1. Fail the request right away instead of retrying
  2. Retry the error, but incorrectly process subsequent responses. A successful 200 response returned by the sever will be ignored. Instead, the SDK will log the initial error, and continue to retry until retry attempts are exhausted. This may happen instead of scenario 1 (failing right away), if many concurrent requests are in progress and a race condition occurs.

Modifications

SplittingTransformer

  • Retryable errors on the first part will be retried. Errors on subsequent parts will not be retried
    • Part numbers will be tracked in IndividualTransformer
    • Update to only call publisherToUpstream.error() in IndividualTransformer.exceptionOccurred() for part numbers greater than 1
  • Retry attempt responses will be properly processed
    • Remove invocation of individualFuture.completeExceptionally() in IndividualTransformer.prepare() when the resultFuture completes

DownloadObjectHelper

  • Forward exceptions from the subscriber future to the result future

MultipartDownloaderSubscriber

  • Keep track of each part GET request and cancel all in-flight ones if onError() is invoked

Testing

Added mock tests
Integ tests passed

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

@davidh44 davidh44 marked this pull request as ready for review July 31, 2025 19:12
@davidh44 davidh44 requested a review from a team as a code owner July 31, 2025 19:12
@zoewangg zoewangg requested a review from Copilot August 1, 2025 17:38
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in the MultipartS3AsyncClient where retryable errors during GetObject operations were failing immediately or causing the initial error to be propagated even after successful retries. The fix improves error handling logic in the SplittingTransformer to properly handle retries for the first part of multipart downloads.

  • Fixed error propagation logic to only retry errors for the first part of multipart downloads
  • Added proper future forwarding to ensure upstream errors are handled correctly
  • Enhanced test coverage with comprehensive WireMock tests for retry scenarios

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
SplittingTransformer.java Core fix - updated error handling logic to properly distinguish first part from subsequent parts and handle retries correctly
MultipartDownloaderSubscriber.java Added future tracking and cleanup for proper error handling
DownloadObjectHelper.java Added future forwarding to ensure exceptions are properly propagated
S3MultipartClientGetObjectWiremockTest.java Comprehensive new test suite covering retry scenarios and error handling
MultipartDownloaderSubscriberWiremockTest.java Removed obsolete test that was failing on first request errors
MultipartS3AsyncClient.java Updated javadoc to reflect GET support
pom.xml Added test dependency for retries module
IndividualPartSubscriberTckTest.java Updated constructor call to match new signature
bugfix changelog Documentation of the fix

@dagnir
Copy link
Contributor

dagnir commented Aug 1, 2025

Please add more context and details in the PR description.

Copy link

sonarqubecloud bot commented Aug 5, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
78.4% Coverage on New Code (required ≥ 80%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@@ -314,14 +317,26 @@ public void onStream(SdkPublisher<ByteBuffer> publisher) {
);
}
}
publisher.subscribe(new IndividualPartSubscriber<>(this.individualFuture, response));

CompletableFutureUtils.forwardResultTo(upstreamFuture, resultFuture);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we forward result here again and do it for every part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we just need to do it for the first part. When this is removed, an error on a subsequent part won't get propagated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, in the new test I added, if we don't call this at all, instead of S3Exception being thrown, we get:

SdkClientException: Unable to execute HTTP request: onError() was already invoked. (SDK Attempt Count: 2)

assertThatThrownBy(() -> multipartClient.getObject(b -> b.bucket(BUCKET).key(KEY),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should probably figure out why onError was invoked multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the flow of events:

  1. Error response is returned
  2. IndividualTransformer.exceptionOccurred() is invoked, which calls publisherToUpstream.error(error)
  3. In SimplePublisher.doProcessQueue(), in the ON_ERROR case failureMessage is set
  4. IndividualPartSubscriber.onNext() is invoked, because there is still outstanding demand. Here, publisherToUpstream.send(byteBuffer) is called
  5. In SimplePublisher.doProcessQueue(), the entry is onNext, but because the failureMessage is set, entry.resultFuture.completeExceptionally(failureMessage.get()) is invoked

Not sure how we can prevent 4) from happening, is it necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could always prevent sending bytes to the publisherToUpstream after IndividualTransformer received an error, would that be a solution?

/**
* Tracks the part number. Errors will only be retried for the first part.
*/
private final AtomicInteger partNumber = new AtomicInteger(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT (non-blocking): I think we should avoid referring to 'parts' in the SplittingTranformer, that's a s3 concept, when this class is more abstract, in a core package. Maybe rename to something like 'onNextSignalsSent' or onNextNumber which would track the total amount of 'onNext' signals sent to the downstreamSubscriber

@@ -259,28 +262,27 @@ private void handleFutureCancel(Throwable e) {
* body publisher.
*/
private class IndividualTransformer implements AsyncResponseTransformer<ResponseT, ResponseT> {
private final int partNumber;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: same here, avoid referring to s3 related terms

@@ -314,14 +317,26 @@ public void onStream(SdkPublisher<ByteBuffer> publisher) {
);
}
}
publisher.subscribe(new IndividualPartSubscriber<>(this.individualFuture, response));

CompletableFutureUtils.forwardResultTo(upstreamFuture, resultFuture);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could always prevent sending bytes to the publisherToUpstream after IndividualTransformer received an error, would that be a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants