-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add a new setting for s3 API call timeout #138072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this'd be better as an internal-cluster test akin to e.g. S3BlobStoreRepositoryTests or S3BlobContainerRetriesTests. That way we can call writeBlob directly, rather than having to construct a shard of the right sort of size to trigger such a call during the snapshot, and also we can cancel the sleep when the request times out.
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Replaced the REST test with an internal cluster test as suggested in 3c1a680 Also added a new |
|
Hi @ywangd, I've created a changelog YAML for you. |
DaveCTurner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions/nits but otherwise looks like a sensible approach to me.
| static final TimeValue CONNECTION_MAX_IDLE_TIME = TimeValue.timeValueSeconds(60); | ||
| static final int MAX_CONNECTIONS = 50; | ||
| static final int RETRY_COUNT = 3; | ||
| static final TimeValue API_CALL_TIMEOUT = TimeValue.ZERO; // default to no API call timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use TimeValue.MINUS_ONE to mean "no timeout", leaving TimeValue.ZERO to mean "time out immediately"? Or at least to defer the meaning of zero down to the SDK, I haven't followed through to determine whether zero means zero or infinity within the SDK but I'd prefer we had some way to say "don't even invoke clientOverrideConfiguration.apiCallTimeout".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out zero is invalid here:
org.elasticsearch.repositories.RepositoryVerificationException: [repository] path [base_path_integration_tests] is not accessible on master node
Caused by: java.lang.IllegalArgumentException: apiCallTimeout must be positive
at software.amazon.awssdk.utils.Validate.isPositive(Validate.java:694) ~[?:?]
at software.amazon.awssdk.utils.Validate.isPositiveOrNull(Validate.java:711) ~[?:?]
at software.amazon.awssdk.core.client.config.ClientOverrideConfiguration.<init>(ClientOverrideConfiguration.java:172) ~[?:?]
at software.amazon.awssdk.core.client.config.ClientOverrideConfiguration$DefaultBuilder.build(ClientOverrideConfiguration.java:1126) ~[?:?]
at software.amazon.awssdk.core.client.config.ClientOverrideConfiguration$DefaultBuilder.build(ClientOverrideConfiguration.java:786) ~[?:?]
at org.elasticsearch.repositories.s3.S3Service.buildConfiguration(S3Service.java:347) ~[?:?]
at org.elasticsearch.repositories.s3.S3Service.buildClientBuilder(S3Service.java:200) ~[?:?]
at org.elasticsearch.repositories.s3.S3Service.buildClient(S3Service.java:193) ~[?:?]
at org.elasticsearch.repositories.s3.S3Service.buildClientReference(S3Service.java:178) ~[?:?]
at org.elasticsearch.repositories.s3.S3ClientsManager$ClientsHolder.client(S3ClientsManager.java:326) ~[?:?]
at org.elasticsearch.repositories.s3.S3ClientsManager.client(S3ClientsManager.java:192) ~[?:?]
at org.elasticsearch.repositories.s3.S3Service.client(S3Service.java:163) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobStore.clientReference(S3BlobStore.java:270) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.executeSingleUpload(S3BlobContainer.java:550) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:148) ~[?:?]
at org.elasticsearch.common.blobstore.BlobContainer.writeBlob(BlobContainer.java:123) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlobAtomic(S3BlobContainer.java:332) ~[?:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.startVerification(BlobStoreRepository.java:2314) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
at org.elasticsearch.repositories.RepositoriesService.lambda$validatePutRepositoryRequest$11(RepositoriesService.java:403) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:37) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1113) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27) ~[elasticsearch-9.3.0-SNAPSHOT.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614) ~[?:?]
at java.lang.Thread.run(Thread.java:1474) ~[?:?]
Still I'd prefer to use -1 to mean "missing".
| /** | ||
| * The maximum time for a single attempt of an API operation | ||
| */ | ||
| final int apiCallTimeoutMillis; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep this as a TimeValue until it's actually used?
| new BytesArray(randomBytes((int) ByteSizeValue.ofMb(10).getBytes())), | ||
| randomBoolean() | ||
| ); | ||
| fail("should have timed out"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use expectThrows() here?
| } catch (IOException e) { | ||
| final var cause = ExceptionsHelper.unwrap(e, ApiCallTimeoutException.class); | ||
| assertNotNull(cause); | ||
| assertThat(cause.getMessage(), containsString("Client execution did not complete before the specified timeout configuration")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we assert something about how these outcomes are captured in the metrics? Should we add something specific to S3RepositoriesMetrics to track this case separately from other exceptions?
| headerDecodedContentLength | ||
| ); | ||
| try { | ||
| final var released = latch.await(60, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use safeAwait()? At least, I'd like us to be asserting that the latch is released in a reasonably timely manner.
This PR adds a new setting to configure AWS SDK API call timeout (https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/timeouts.html). Defaults to 0, i.e. no timeout.