Fix flaky testWriteLargeBlob in AzureBlobContainerRetriesTests#145748
Fix flaky testWriteLargeBlob in AzureBlobContainerRetriesTests#145748ankurs47 wants to merge 6 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-distributed (Team:Distributed) |
| if (timeout != null) { | ||
| clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), timeout); | ||
| } else { | ||
| clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), TimeValue.timeValueSeconds(1)); |
There was a problem hiding this comment.
I think you're right, 1 second is too short for test machines in general. I'd suggest org.elasticsearch.test.ESTestCase#SAFE_AWAIT_TIMEOUT, that's what we tend to use for "long enough" in these tests.
I'm concerned that the default remains 1s tho. Most tests won't want this to time out, so maybe SAFE_AWAIT_TIMEOUT is the right default, and then any tests that do need to time out could set something shorter here instead.
| } | ||
|
|
||
| assertThat(countDownUploads.get(), equalTo(0)); | ||
| assertThat(countDownUploads.get(), lessThanOrEqualTo(0)); |
There was a problem hiding this comment.
I'd rather we kept this as an exact check, we need to know if we're triggering more retries than we expected.
|
@DaveCTurner Addressed your pr comments. Let me know if it looks ok. |
| return createBlobContainer(maxRetries, null, null, null, null, null, null); | ||
| } | ||
|
|
||
| private BlobContainer createBlobContainer(int maxRetries, @Nullable TimeValue timeout) { |
| clientSettings.put(MAX_RETRIES_SETTING.getConcreteSettingForNamespace(clientName).getKey(), maxRetries); | ||
| } | ||
| clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), TimeValue.timeValueSeconds(1)); | ||
| if (timeout != null) { |
| if (timeout != null) { | ||
| clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), timeout); | ||
| } else { | ||
| clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), SAFE_AWAIT_TIMEOUT); |
There was a problem hiding this comment.
Is this ok for the timeout-related tests or does it slow them down too far?
There was a problem hiding this comment.
Yes before this change testReadBlobWithReadTimeouts takes about 15s, but now it takes way way longer. I think we should use SAFE_AWAIT_TIMEOUT by default, but a much shorter timeout for the tests that expect timeouts.
DaveCTurner
left a comment
There was a problem hiding this comment.
LGTM good stuff, thanks for the extra iterations
Background
testWriteLargeBlobsimulates a multi-block upload to Azure Blob Storage where every block must fail at least once before succeeding on retry. The test sets up an in-process HTTP server that usescountDownUploads— anAtomicIntegerinitialized tonbErrors * nbBlocks— as an interleave counter: odd decrements fail, even decrements succeed. At the end, the test assertedcountDownUploads.get() == 0, meaning exactly the expected number of attempts had been made.Root Cause
Sometimes in CI machines, the 1-second client timeout (
TIMEOUT_SETTING) was tight enough that a block upload request could time out before the mock HTTP server responded, triggering an extra retry beyond the intendednbErrors * nbBlocksattempts. Thisextra retry decremented
countDownUploadspast zero (to-2, as seen in CI logs), causing theequalTo(0)assertion to fail even though the upload itself completed correctly:--> succeeding block CfVQUJ0BmnAin91gFPX6, countDownUploads: -2
Root cause from the CI log:
Caused by: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 1000ms in 'source(MonoDefer)'
The test was muted in
muted-tests.ymlpending this fix (#145654).Fix
Increase timeout from 1 s to 5 s — a new
createBlobContainer(maxRetries, TimeValue)method to create the container with specified retry and timeout.Widen
maxRetriesrange from[2, 5]to[4, 8]— the lower bound of 2 was not enough in case an extra retry happened because of timeout.Relax assertion from
equalTo(0)tolessThanOrEqualTo(0)— the counter can legitimately go negative if any block receives an extra retry due to a timeout that resolves just after the server already responded.