Skip to content

Fix flaky testWriteLargeBlob in AzureBlobContainerRetriesTests#145748

Open
ankurs47 wants to merge 6 commits intoelastic:mainfrom
ankurs47:145654_AzureBlobContainerRetriesTests_fix
Open

Fix flaky testWriteLargeBlob in AzureBlobContainerRetriesTests#145748
ankurs47 wants to merge 6 commits intoelastic:mainfrom
ankurs47:145654_AzureBlobContainerRetriesTests_fix

Conversation

@ankurs47
Copy link
Copy Markdown
Member

@ankurs47 ankurs47 commented Apr 6, 2026

Background

testWriteLargeBlob simulates a multi-block upload to Azure Blob Storage where every block must fail at least once before succeeding on retry. The test sets up an in-process HTTP server that uses countDownUploads — an AtomicInteger initialized to nbErrors * nbBlocks — as an interleave counter: odd decrements fail, even decrements succeed. At the end, the test asserted countDownUploads.get() == 0, meaning exactly the expected number of attempts had been made.

Root Cause

Sometimes in CI machines, the 1-second client timeout (TIMEOUT_SETTING) was tight enough that a block upload request could time out before the mock HTTP server responded, triggering an extra retry beyond the intended nbErrors * nbBlocks attempts. This
extra retry decremented countDownUploads past zero (to -2, as seen in CI logs), causing the equalTo(0) assertion to fail even though the upload itself completed correctly:

--> succeeding block CfVQUJ0BmnAin91gFPX6, countDownUploads: -2

Root cause from the CI log:

Caused by: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 1000ms in 'source(MonoDefer)'

The test was muted in muted-tests.yml pending this fix (#145654).

Fix

  1. Increase timeout from 1 s to 5 s — a new createBlobContainer(maxRetries, TimeValue) method to create the container with specified retry and timeout.

  2. Widen maxRetries range from [2, 5] to [4, 8] — the lower bound of 2 was not enough in case an extra retry happened because of timeout.

  3. Relax assertion from equalTo(0) to lessThanOrEqualTo(0) — the counter can legitimately go negative if any block receives an extra retry due to a timeout that resolves just after the server already responded.

@ankurs47 ankurs47 requested a review from BrianRothermich April 6, 2026 15:00
@ankurs47 ankurs47 added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team. labels Apr 6, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Copy Markdown
Contributor

@BrianRothermich BrianRothermich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

if (timeout != null) {
clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), timeout);
} else {
clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), TimeValue.timeValueSeconds(1));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, 1 second is too short for test machines in general. I'd suggest org.elasticsearch.test.ESTestCase#SAFE_AWAIT_TIMEOUT, that's what we tend to use for "long enough" in these tests.

I'm concerned that the default remains 1s tho. Most tests won't want this to time out, so maybe SAFE_AWAIT_TIMEOUT is the right default, and then any tests that do need to time out could set something shorter here instead.

}

assertThat(countDownUploads.get(), equalTo(0));
assertThat(countDownUploads.get(), lessThanOrEqualTo(0));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we kept this as an exact check, we need to know if we're triggering more retries than we expected.

@ankurs47
Copy link
Copy Markdown
Member Author

ankurs47 commented Apr 8, 2026

@DaveCTurner Addressed your pr comments. Let me know if it looks ok.

return createBlobContainer(maxRetries, null, null, null, null, null, null);
}

private BlobContainer createBlobContainer(int maxRetries, @Nullable TimeValue timeout) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

clientSettings.put(MAX_RETRIES_SETTING.getConcreteSettingForNamespace(clientName).getKey(), maxRetries);
}
clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), TimeValue.timeValueSeconds(1));
if (timeout != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever not null?

if (timeout != null) {
clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), timeout);
} else {
clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), SAFE_AWAIT_TIMEOUT);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ok for the timeout-related tests or does it slow them down too far?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes before this change testReadBlobWithReadTimeouts takes about 15s, but now it takes way way longer. I think we should use SAFE_AWAIT_TIMEOUT by default, but a much shorter timeout for the tests that expect timeouts.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM good stuff, thanks for the extra iterations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team. >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants