Fix flaky testWriteLargeBlob in AzureBlobContainerRetriesTests#145748

Open

ankurs47 wants to merge 6 commits intoelastic:mainfrom

ankurs47:145654_AzureBlobContainerRetriesTests_fix

Member

ankurs47 commented Apr 6, 2026

Background

testWriteLargeBlob simulates a multi-block upload to Azure Blob Storage where every block must fail at least once before succeeding on retry. The test sets up an in-process HTTP server that uses countDownUploads — an AtomicInteger initialized to nbErrors * nbBlocks — as an interleave counter: odd decrements fail, even decrements succeed. At the end, the test asserted countDownUploads.get() == 0, meaning exactly the expected number of attempts had been made.

Root Cause

Sometimes in CI machines, the 1-second client timeout (TIMEOUT_SETTING) was tight enough that a block upload request could time out before the mock HTTP server responded, triggering an extra retry beyond the intended nbErrors * nbBlocks attempts. This
extra retry decremented countDownUploads past zero (to -2, as seen in CI logs), causing the equalTo(0) assertion to fail even though the upload itself completed correctly:

--> succeeding block CfVQUJ0BmnAin91gFPX6, countDownUploads: -2

Root cause from the CI log:

Caused by: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 1000ms in 'source(MonoDefer)'

The test was muted in muted-tests.yml pending this fix (#145654).

Fix

Increase timeout from 1 s to 5 s — a new createBlobContainer(maxRetries, TimeValue) method to create the container with specified retry and timeout.
Widen maxRetries range from [2, 5] to [4, 8] — the lower bound of 2 was not enough in case an extra retry happened because of timeout.
Relax assertion from equalTo(0) to lessThanOrEqualTo(0) — the counter can legitimately go negative if any block receives an extra retry due to a timeout that resolves just after the server already responded.


          fix flaky test

dc2312c

ankurs47 requested a review from BrianRothermich

April 6, 2026 15:00

ankurs47 added >test :Distributed/Snapshot/Restore Team:Distributed labels

elasticsearchmachine added the v9.4.0 label

Collaborator

elasticsearchmachine commented Apr 6, 2026

Pinging @elastic/es-distributed (Team:Distributed)

BrianRothermich approved these changes

View reviewed changes

Contributor

BrianRothermich left a comment

LGTM 👍

DaveCTurner reviewed

View reviewed changes

...azure/src/test/java/org/elasticsearch/repositories/azure/AzureBlobContainerRetriesTests.java Outdated

+                      if (timeout != null) {
+                          clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), timeout);
+                      } else {
+                          clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), TimeValue.timeValueSeconds(1));

Member

DaveCTurner Apr 8, 2026

I think you're right, 1 second is too short for test machines in general. I'd suggest org.elasticsearch.test.ESTestCase#SAFE_AWAIT_TIMEOUT, that's what we tend to use for "long enough" in these tests.

I'm concerned that the default remains 1s tho. Most tests won't want this to time out, so maybe SAFE_AWAIT_TIMEOUT is the right default, and then any tests that do need to time out could set something shorter here instead.

...azure/src/test/java/org/elasticsearch/repositories/azure/AzureBlobContainerRetriesTests.java Outdated

                       }
-                      assertThat(countDownUploads.get(), equalTo(0));
+                      assertThat(countDownUploads.get(), lessThanOrEqualTo(0));

Member

DaveCTurner Apr 8, 2026

I'd rather we kept this as an exact check, we need to know if we're triggering more retries than we expected.


          Address pr comments

5acc548

Member Author

ankurs47 commented Apr 8, 2026

@DaveCTurner Addressed your pr comments. Let me know if it looks ok.

DaveCTurner reviewed

View reviewed changes

...azure/src/test/java/org/elasticsearch/repositories/azure/AzureBlobContainerRetriesTests.java Outdated

                       return createBlobContainer(maxRetries, null, null, null, null, null, null);
                   }
+                  private BlobContainer createBlobContainer(int maxRetries, @Nullable TimeValue timeout) {

Member

DaveCTurner Apr 8, 2026

Is this used anywhere?

...azure/src/test/java/org/elasticsearch/repositories/azure/AzureBlobContainerRetriesTests.java

                           clientSettings.put(MAX_RETRIES_SETTING.getConcreteSettingForNamespace(clientName).getKey(), maxRetries);
                       }
-                      clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), TimeValue.timeValueSeconds(1));
+                      if (timeout != null) {

Member

DaveCTurner Apr 8, 2026

Is this ever not null?

...azure/src/test/java/org/elasticsearch/repositories/azure/AzureBlobContainerRetriesTests.java

+                      if (timeout != null) {
+                          clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), timeout);
+                      } else {
+                          clientSettings.put(TIMEOUT_SETTING.getConcreteSettingForNamespace(clientName).getKey(), SAFE_AWAIT_TIMEOUT);

Member

DaveCTurner Apr 8, 2026

Is this ok for the timeout-related tests or does it slow them down too far?

Member

DaveCTurner Apr 8, 2026

Yes before this change testReadBlobWithReadTimeouts takes about 15s, but now it takes way way longer. I think we should use SAFE_AWAIT_TIMEOUT by default, but a much shorter timeout for the tests that expect timeouts.

ankurs47 and others added 2 commits

April 9, 2026 12:17


          Address pr comments related to unused code and timeout tests.

8f87bb1


          [CI] Auto commit changes from spotless

eab07dc

DaveCTurner approved these changes

View reviewed changes

Member

DaveCTurner left a comment

LGTM good stuff, thanks for the extra iterations

ankurs47 added 2 commits

April 9, 2026 13:24


          Merge branch 'main' into 145654_AzureBlobContainerRetriesTests_fix

8dcd429


          Merge branch 'main' into 145654_AzureBlobContainerRetriesTests_fix

6d62690

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Team:Distributed >test v9.4.0