Skip to content

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Nov 14, 2025

This PR adds a new setting to configure AWS SDK API call timeout (https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/timeouts.html). Defaults to 0, i.e. no timeout.

@ywangd ywangd added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.3.0 labels Nov 14, 2025
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this'd be better as an internal-cluster test akin to e.g. S3BlobStoreRepositoryTests or S3BlobContainerRetriesTests. That way we can call writeBlob directly, rather than having to construct a shard of the right sort of size to trigger such a call during the snapshot, and also we can cancel the sleep when the request times out.

@ywangd ywangd marked this pull request as ready for review November 28, 2025 07:34
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Nov 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@ywangd
Copy link
Member Author

ywangd commented Nov 28, 2025

Replaced the REST test with an internal cluster test as suggested in 3c1a680

Also added a new api_call_timeout setting. It defaults to 0, i.e. disabled, so that it is a no-op for existing clusters. We can look into enabling it separately.

@ywangd ywangd requested a review from DaveCTurner November 28, 2025 07:37
@ywangd ywangd changed the title Test for S3 upload time unresponsiveness Add a new setting for s3 API call timeout Nov 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions/nits but otherwise looks like a sensible approach to me.

static final TimeValue CONNECTION_MAX_IDLE_TIME = TimeValue.timeValueSeconds(60);
static final int MAX_CONNECTIONS = 50;
static final int RETRY_COUNT = 3;
static final TimeValue API_CALL_TIMEOUT = TimeValue.ZERO; // default to no API call timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use TimeValue.MINUS_ONE to mean "no timeout", leaving TimeValue.ZERO to mean "time out immediately"? Or at least to defer the meaning of zero down to the SDK, I haven't followed through to determine whether zero means zero or infinity within the SDK but I'd prefer we had some way to say "don't even invoke clientOverrideConfiguration.apiCallTimeout".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out zero is invalid here:

org.elasticsearch.repositories.RepositoryVerificationException: [repository] path [base_path_integration_tests] is not accessible on master node
Caused by: java.lang.IllegalArgumentException: apiCallTimeout must be positive
        at software.amazon.awssdk.utils.Validate.isPositive(Validate.java:694) ~[?:?]
        at software.amazon.awssdk.utils.Validate.isPositiveOrNull(Validate.java:711) ~[?:?]
        at software.amazon.awssdk.core.client.config.ClientOverrideConfiguration.<init>(ClientOverrideConfiguration.java:172) ~[?:?]
        at software.amazon.awssdk.core.client.config.ClientOverrideConfiguration$DefaultBuilder.build(ClientOverrideConfiguration.java:1126) ~[?:?]
        at software.amazon.awssdk.core.client.config.ClientOverrideConfiguration$DefaultBuilder.build(ClientOverrideConfiguration.java:786) ~[?:?]
        at org.elasticsearch.repositories.s3.S3Service.buildConfiguration(S3Service.java:347) ~[?:?]
        at org.elasticsearch.repositories.s3.S3Service.buildClientBuilder(S3Service.java:200) ~[?:?]
        at org.elasticsearch.repositories.s3.S3Service.buildClient(S3Service.java:193) ~[?:?]
        at org.elasticsearch.repositories.s3.S3Service.buildClientReference(S3Service.java:178) ~[?:?]
        at org.elasticsearch.repositories.s3.S3ClientsManager$ClientsHolder.client(S3ClientsManager.java:326) ~[?:?]
        at org.elasticsearch.repositories.s3.S3ClientsManager.client(S3ClientsManager.java:192) ~[?:?]
        at org.elasticsearch.repositories.s3.S3Service.client(S3Service.java:163) ~[?:?]
        at org.elasticsearch.repositories.s3.S3BlobStore.clientReference(S3BlobStore.java:270) ~[?:?]
        at org.elasticsearch.repositories.s3.S3BlobContainer.executeSingleUpload(S3BlobContainer.java:550) ~[?:?]
        at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:148) ~[?:?]
        at org.elasticsearch.common.blobstore.BlobContainer.writeBlob(BlobContainer.java:123) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
        at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlobAtomic(S3BlobContainer.java:332) ~[?:?]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.startVerification(BlobStoreRepository.java:2314) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
        at org.elasticsearch.repositories.RepositoriesService.lambda$validatePutRepositoryRequest$11(RepositoriesService.java:403) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
        at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:37) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1113) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614) ~[?:?]
        at java.lang.Thread.run(Thread.java:1474) ~[?:?]

Still I'd prefer to use -1 to mean "missing".

/**
* The maximum time for a single attempt of an API operation
*/
final int apiCallTimeoutMillis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep this as a TimeValue until it's actually used?

new BytesArray(randomBytes((int) ByteSizeValue.ofMb(10).getBytes())),
randomBoolean()
);
fail("should have timed out");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use expectThrows() here?

} catch (IOException e) {
final var cause = ExceptionsHelper.unwrap(e, ApiCallTimeoutException.class);
assertNotNull(cause);
assertThat(cause.getMessage(), containsString("Client execution did not complete before the specified timeout configuration"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we assert something about how these outcomes are captured in the metrics? Should we add something specific to S3RepositoriesMetrics to track this case separately from other exceptions?

headerDecodedContentLength
);
try {
final var released = latch.await(60, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use safeAwait()? At least, I'd like us to be asserting that the latch is released in a reasonably timely manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants