fixed concurrent translog download failover by ThyTran1402 · Pull Request #20906 · opensearch-project/OpenSearch

ThyTran1402 · 2026-03-18T05:15:36Z

Description

During failover and primary relocations, OpenSearch downloads all translog generations from remote store before opening the engine. Previously this was done sequentially and one generation at a time in a blocking for-loop inside RemoteFsTranslog.downloadOnce(). Since translog generations are independent of each other, this was a pure serial I/O bottleneck that scaled linearly with the number of retained generations.

This change parallelises translog generation downloads using the existing TRANSLOG_TRANSFER threadpool (already used for uploads), following the same CountDownLatch + worker-queue pattern established for concurrent segment downloads for #10519.

Related Issues

Resolves #10826

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2026-03-18T05:17:12Z

PR Reviewer Guide 🔍

(Review updated until commit `f5f393a`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Add new dynamic setting for max concurrent translog download streams Relevant files: server/src/main/java/org/opensearch/indices/recovery/RecoverySettings.java server/src/main/java/org/opensearch/common/settings/ClusterSettings.java Sub-PR theme: Implement parallel translog download in TranslogTransferManager Relevant files: server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java server/src/test/java/org/opensearch/index/translog/transfer/TranslogTransferManagerTests.java Sub-PR theme: Wire parallel translog download into RemoteFsTranslog and IndexShard Relevant files: server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java server/src/main/java/org/opensearch/index/shard/IndexShard.java server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java
⚡ Recommended focus areas for review Error Propagation In `downloadTranslogsParallel`, when a timeout occurs, `queue.clear()` signals workers to stop but the latch may not have counted down yet. After the timeout IOException is thrown, the method returns without waiting for in-flight workers to complete. This could lead to concurrent file writes to `location` after the caller believes the operation has failed, potentially corrupting partially-written translog files. try { long timeoutMillis = remoteStoreSettings.getClusterRemoteTranslogTransferTimeout().millis(); if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) { queue.clear(); // signal workers to stop throw new IOException("Timed out waiting for parallel translog downloads"); } } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new IOException("Interrupted during parallel translog download", e); } if (failure.get() != null) { throw failure.get(); } Thread Pool Null Check In `downloadTranslogsParallel`, when `threadPool` is null, the code falls back to `threads = 1` and executes sequentially. However, the null check happens after `Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams)`, and the threadPool executor is accessed unconditionally at line 372 inside the parallel branch. If `threadPool` is null but `threads > 1` somehow (which cannot happen with current logic but is fragile), it would NPE. More importantly, the `threadPool.info(...)` call at line 354 could NPE if `threadPool` is non-null but the `TRANSLOG_TRANSFER` executor is not registered. int threads = Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams); if (threadPool != null) { threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax()); } else { threads = 1; } Backward Compatibility The old single-arg constructor for `RemoteFsTranslog` now delegates to the new constructor with `null` for `recoverySettings`, and the `downloadOnce` path uses `maxConcurrentStreams = 1` when `recoverySettings` is null. This means existing callers that don't pass `RecoverySettings` will always download sequentially. Verify that all production call sites (especially in `IndexShard`) pass the `RecoverySettings` correctly, as missing it silently degrades to sequential behavior without any warning. int maxConcurrentStreams = recoverySettings != null ? recoverySettings.getMaxConcurrentTranslogDownloadStreams() : 1; download(translogTransferManager, location, logger, config.shouldSeedRemote(), 0, maxConcurrentStreams); Test Flakiness `testDownloadTranslogsParallelMultipleGenerations` uses `Thread.sleep(delayMs)` (50ms) inside mock answers to simulate concurrent downloads. This timing-based approach can be flaky in slow CI environments. The test also doesn't verify that downloads actually ran in parallel (e.g., by checking elapsed time or a concurrency counter), so it doesn't validate the core behavior being tested. long delayMs = 50; for (long gen = 1; gen <= 3; gen++) { String tlogFile = "translog-" + gen + ".tlog"; String ckpFile = "translog-" + gen + ".ckp"; when(transferService.downloadBlob(any(BlobPath.class), eq(tlogFile))).thenAnswer(invocation -> { Thread.sleep(delayMs); return new java.io.ByteArrayInputStream(tlogBytes); }); when(transferService.downloadBlob(any(BlobPath.class), eq(ckpFile))).thenAnswer(invocation -> { Thread.sleep(delayMs); return new java.io.ByteArrayInputStream(ckpBytes); }); } Path location = createTempDir(); List<org.opensearch.common.collect.Tuple<String, String>> pairs = new ArrayList<>(); pairs.add(new org.opensearch.common.collect.Tuple<>("1", "3")); pairs.add(new org.opensearch.common.collect.Tuple<>("1", "2")); pairs.add(new org.opensearch.common.collect.Tuple<>("1", "1")); parallelManager.downloadTranslogsParallel(pairs, location, 4); // All 6 files should exist for (long gen = 1; gen <= 3; gen++) { assertTrue(Files.exists(location.resolve("translog-" + gen + ".tlog"))); assertTrue(Files.exists(location.resolve("translog-" + gen + ".ckp"))); }

github-actions · 2026-03-18T05:17:52Z

PR Code Suggestions ✨

Latest suggestions up to f5f393a

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Catch unchecked exceptions in worker threads The worker only catches `IOException`, but `downloadTranslog` or underlying code could throw unchecked exceptions (e.g., `RuntimeException`). If an unchecked exception escapes the worker's `run` method, the `latch.countDown()` in `finally` will still execute, but the exception will be swallowed by the executor and the failure will not be propagated to the caller, potentially causing a silent partial download. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [378-383] } catch (IOException e) { failure.compareAndSet(null, e); + queue.clear(); +} catch (Exception e) { + failure.compareAndSet(null, new IOException("Unexpected error during parallel translog download", e)); queue.clear(); } finally { latch.countDown(); } Suggestion importance[1-10]: 7 __ Why: This is a valid concern — unchecked exceptions thrown by worker threads would be swallowed by the executor, leading to silent partial downloads. Adding a catch for `Exception` to wrap and propagate them via `failure` is a correct and important fix for robustness.	Medium
	Guard against null thread pool info `threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER)` can return `null` if the thread pool name does not exist, which would cause a `NullPointerException`. A null check should be added before calling `.getMax()`. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [352-357] int threads = Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams); if (threadPool != null) { - threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax()); + ThreadPool.Info info = threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER); + if (info != null) { + threads = Math.min(threads, info.getMax()); + } else { + threads = 1; + } } else { threads = 1; } Suggestion importance[1-10]: 6 __ Why: `threadPool.info()` can return `null` for an unknown thread pool name, which would cause a `NullPointerException`. The null check is a valid defensive improvement, though in practice `TRANSLOG_TRANSFER` should always be registered.	Low
	Handle lingering threads after timeout When a timeout occurs, `latch.await` returns but the worker threads may still be running and accessing shared state (the `queue` and `failure` reference). Clearing the queue signals them to stop, but there is no guarantee they have finished before the method returns and the caller potentially deletes or modifies the `location` directory. Consider adding a secondary bounded wait or tracking thread completion more robustly to avoid race conditions on the filesystem after timeout. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [387-396] try { long timeoutMillis = remoteStoreSettings.getClusterRemoteTranslogTransferTimeout().millis(); if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) { queue.clear(); // signal workers to stop + // Wait a short grace period for workers to observe the cleared queue and exit + latch.await(5, TimeUnit.SECONDS); throw new IOException("Timed out waiting for parallel translog downloads"); } } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new IOException("Interrupted during parallel translog download", e); } Suggestion importance[1-10]: 4 __ Why: The suggestion adds a secondary `latch.await` after timeout to give workers a grace period to finish, which is a reasonable defensive measure. However, the `improved_code` itself has a bug: the second `latch.await` call can throw `InterruptedException` but is not handled, and the overall approach of waiting after a timeout may not be the cleanest solution. The issue is real but the fix is incomplete.	Low
General	Log fallback to sequential download mode When `recoverySettings` is `null`, the fallback value is `1` (sequential), which silently disables the parallel download feature. This can happen when the no-`RecoverySettings` constructor is used. Consider logging a debug/warning message when falling back to sequential mode so operators are aware that parallel download is not active. server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [189-192] -int maxConcurrentStreams = recoverySettings != null - ? recoverySettings.getMaxConcurrentTranslogDownloadStreams() - : 1; +int maxConcurrentStreams; +if (recoverySettings != null) { + maxConcurrentStreams = recoverySettings.getMaxConcurrentTranslogDownloadStreams(); +} else { + logger.debug("RecoverySettings not available; falling back to sequential translog download"); + maxConcurrentStreams = 1; +} download(translogTransferManager, location, logger, config.shouldSeedRemote(), 0, maxConcurrentStreams); Suggestion importance[1-10]: 2 __ Why: Adding a debug log when falling back to sequential mode is a minor observability improvement. The suggestion is valid but has low impact since the fallback behavior is already documented in the code structure.	Low

Previous suggestions

Suggestions up to commit 680aff8

Category	Suggestion	Impact
Possible issue	Add cancellation flag to stop workers on timeout When a timeout occurs, the method throws an `IOException` but the worker threads may still be running since `queue.clear()` only prevents new work from being picked up — threads already inside `downloadTranslog()` will continue executing. This can lead to partial file writes or resource contention after the caller has already failed. Consider using an `AtomicBoolean` cancellation flag checked inside the worker loop, or interrupting the worker threads explicitly on timeout. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [387-396] +AtomicBoolean cancelled = new AtomicBoolean(false); +// ... in worker loop: while (!cancelled.get() && failure.get() == null && (pair = queue.poll()) != null) + try { long timeoutMillis = remoteStoreSettings.getClusterRemoteTranslogTransferTimeout().millis(); if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) { - queue.clear(); // signal workers to stop + cancelled.set(true); + queue.clear(); throw new IOException("Timed out waiting for parallel translog downloads"); } } catch (InterruptedException e) { + cancelled.set(true); Thread.currentThread().interrupt(); throw new IOException("Interrupted during parallel translog download", e); } Suggestion importance[1-10]: 5 __ Why: The concern about lingering worker threads after timeout is valid, but the `improved_code` only shows a partial solution (the comment `// ... in worker loop` is not actual code), making it incomplete. The issue is real but the fix as presented is not directly applicable.	Low
General	Restore coverage for both retry-triggering exception types The test for "file found in second attempt" removed the original `when(...).thenThrow(toThrow).thenReturn(true)` mock but the new `doAnswer` mock increments `downloadCounter` on each call to `downloadTranslogsParallel`. However, the retry loop in `download()` calls `downloadOnce` which calls `downloadTranslogsParallel` once per attempt — so the counter logic is correct. But the test no longer verifies that a `FileNotFoundException` (as opposed to only `NoSuchFileException`) also triggers a retry, reducing test coverage for the failover behavior. server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java [1720-1728] +String msg = "File not found"; +Exception toThrow = randomBoolean() ? new NoSuchFileException(msg) : new FileNotFoundException(msg); +AtomicLong downloadCounter = new AtomicLong(); doAnswer(invocation -> { if (downloadCounter.incrementAndGet() <= 1) { - throw new NoSuchFileException("File not found"); + if (toThrow instanceof NoSuchFileException) throw (NoSuchFileException) toThrow; + else throw (FileNotFoundException) toThrow; } else if (downloadCounter.get() == 2) { Files.createFile(location.resolve(Translog.getCommitCheckpointFileName(generation))); } return null; }).when(mockTransfer).downloadTranslogsParallel(any(), any(), anyInt()); -// no exception thrown -RemoteFsTranslog.download(mockTransfer, location, logger, false, 0); - Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies that the refactored test no longer covers `FileNotFoundException` as a retry trigger, which reduces test coverage for an important failover behavior. The improved code restores this coverage with a valid approach.	Low
	Consolidate null threadPool check with sequential fallback If `threadPool` is `null`, the code falls back to `threads = 1` (sequential), but the `threadPool.executor(...)` call later will throw a `NullPointerException` if somehow `threads > 1` is reached through a different code path. More critically, when `threadPool` is `null` and `threads` was already set to `Math.min(size,` `maxConcurrentStreams)` before this block, setting `threads = 1` is correct but the sequential fallback block below only triggers on `threads <= 1`, so this is safe — however, the null check should be consolidated with the sequential fallback to make the logic clearer and avoid future bugs. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [353-357] -if (threadPool != null) { - threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax()); -} else { - threads = 1; +if (threadPool == null \|\| threads <= 1) { + for (Tuple<String, String> p : generationPrimaryTermPairs) { + downloadTranslog(p.v1(), p.v2(), location); + } + return; +} +threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax()); +if (threads <= 1) { + for (Tuple<String, String> p : generationPrimaryTermPairs) { + downloadTranslog(p.v1(), p.v2(), location); + } + return; } Suggestion importance[1-10]: 4 __ Why: The suggestion improves code clarity by consolidating the null check with the sequential fallback, but the `improved_code` duplicates the sequential loop twice which is not clean. The existing logic is functionally correct and safe, so this is a minor style improvement.	Low
	Replace magic number with named constant The default fallback value of `1` when `recoverySettings` is `null` means sequential download, which is safe but may be inconsistent with the intent of the feature. More importantly, the `recoverySettings` field is set just before this code block, so it will always be the value passed to the constructor (possibly `null`). The null check is correct, but the magic number `1` should be extracted to a named constant for clarity and maintainability. server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [189-192] int maxConcurrentStreams = recoverySettings != null ? recoverySettings.getMaxConcurrentTranslogDownloadStreams() - : 1; + : RecoverySettings.DEFAULT_MAX_CONCURRENT_TRANSLOG_DOWNLOAD_STREAMS; download(translogTransferManager, location, logger, config.shouldSeedRemote(), 0, maxConcurrentStreams); Suggestion importance[1-10]: 3 __ Why: Extracting the magic number `1` to a named constant improves maintainability, but the suggested constant `RecoverySettings.DEFAULT_MAX_CONCURRENT_TRANSLOG_DOWNLOAD_STREAMS` doesn't exist in the PR and would require additional changes to `RecoverySettings.java`.	Low

Suggestions up to commit de9745c

Category	Suggestion	Impact
Possible issue	Prevent NullPointerException when threadPool is null When `threadPool` is `null`, the code proceeds to the parallel path and calls `threadPool.executor(...)`, which will throw a `NullPointerException`. If `threadPool` is null and `threads > 1`, the code should fall back to sequential execution instead of attempting parallel execution. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [352-363] int threads = Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams); if (threadPool != null) { threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax()); +} else { + threads = 1; } // Fall back to sequential for single thread (avoids overhead) if (threads <= 1) { for (Tuple<String, String> p : generationPrimaryTermPairs) { downloadTranslog(p.v1(), p.v2(), location); } return; } Suggestion importance[1-10]: 9 __ Why: This is a critical bug: when `threadPool` is `null` and `maxConcurrentStreams > 1` with multiple generations, the code skips the null-check branch but then calls `threadPool.executor(...)` at line 370, causing a `NullPointerException`. Setting `threads = 1` when `threadPool` is null correctly forces sequential fallback.	High
Possible issue	Clear queue on timeout to stop worker threads When a timeout occurs, the worker threads may still be running and processing items from the queue. You should clear the queue and signal workers to stop when a timeout is detected, similar to how failures are handled. Without this, threads may continue running after the calling thread has already thrown the timeout exception. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [385-397] try { long timeoutMillis = remoteStoreSettings.getClusterRemoteTranslogTransferTimeout().millis(); if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) { + queue.clear(); throw new IOException("Timed out waiting for parallel translog downloads"); } } catch (InterruptedException e) { Thread.currentThread().interrupt(); + queue.clear(); throw new IOException("Interrupted during parallel translog download", e); } if (failure.get() != null) { throw failure.get(); } Suggestion importance[1-10]: 6 __ Why: When a timeout occurs, worker threads continue consuming from the queue unnecessarily. Adding `queue.clear()` on timeout and interrupt paths is a valid improvement to prevent wasted work, though threads will naturally stop once the queue is empty anyway.	Low
General	Replace flaky timing assertion with concurrency verification This timing-based assertion is inherently flaky and can fail on slow CI machines or under high load. The test asserts that parallel execution completes in less than `3 ` `2 50ms = 300ms`, but this is not a reliable guarantee. Consider verifying parallelism through other means, such as tracking concurrent execution counts, rather than wall-clock time. server/src/test/java/org/opensearch/index/translog/transfer/TranslogTransferManagerTests.java [1019-1049] -long delayMs = 50; -... -// Parallel execution should be faster than sequential (3 * 2 * delayMs) -assertTrue("Expected parallel download faster than sequential, elapsed=" + elapsed, elapsed < 3 * 2 * delayMs); +// Use a counter to verify concurrent execution instead of timing +AtomicInteger maxConcurrent = new AtomicInteger(0); +AtomicInteger currentConcurrent = new AtomicInteger(0); +// ... instrument mocks to track concurrency and assert maxConcurrent.get() > 1 Suggestion importance[1-10]: 4 __ Why: The timing-based assertion `elapsed < 3 * 2 * delayMs` is inherently flaky on slow CI machines. However, the improved code snippet is incomplete and doesn't provide a ready-to-use replacement, making it harder to apply directly.	Low

Suggestions up to commit d9a49de

Category	Suggestion	Impact
Possible issue	Guard against null threadPool in parallel path When `threadPool` is `null`, the code falls through to the parallel path and submits tasks via `threadPool.executor(...)`, which will throw a `NullPointerException`. If `threadPool` is null and `threads > 1`, the parallel execution path must not be taken. Add a null-check guard to force sequential execution when `threadPool` is null. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [352-366] int threads = Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams); if (threadPool != null) { threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax()); } -// Fall back to sequential for single thread (avoids overhead) -if (threads <= 1) { +// Fall back to sequential for single thread or no threadPool (avoids overhead / NPE) +if (threads <= 1 \|\| threadPool == null) { for (Tuple<String, String> p : generationPrimaryTermPairs) { downloadTranslog(p.v1(), p.v2(), location); } return; } ConcurrentLinkedQueue<Tuple<String, String>> queue = new ConcurrentLinkedQueue<>(generationPrimaryTermPairs); CountDownLatch latch = new CountDownLatch(threads); Suggestion importance[1-10]: 9 __ Why: When `threadPool` is `null` and `maxConcurrentStreams > 1` with multiple generations, the code skips the null-check branch and proceeds to call `threadPool.executor(...)`, causing a `NullPointerException`. This is a real bug that needs to be fixed.	High
Possible issue	Signal workers to stop on timeout When a timeout occurs, the worker threads are still running and may continue to write to `location` or modify shared state after the calling thread has already thrown an exception. You should interrupt or signal the workers to stop (e.g., by setting a flag or clearing the queue) before throwing the timeout exception to avoid resource leaks and partial writes. server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [386-388] if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) { + queue.clear(); // signal workers to stop throw new IOException("Timed out waiting for parallel translog downloads"); } Suggestion importance[1-10]: 7 __ Why: When a timeout occurs, worker threads continue running and may cause partial writes or resource leaks. Clearing the `queue` on timeout is a simple and effective way to signal workers to stop, improving correctness.	Medium
General	Fix mismatched mock method in retry test The test for "file found in second attempt" uses `mockTransfer.readMetadata()` (no-arg) while the first mock used `readMetadata(0)`. The `download` method called in the test uses a `timestamp` of `0`, so it likely calls `readMetadata(0)`. If the no-arg and single-arg overloads are different, the mock won't match and the test will fail or behave unexpectedly. server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java [1715-1728] // File not found in first attempt . File found in second attempt. mockTransfer = mock(TranslogTransferManager.class); -when(mockTransfer.readMetadata()).thenReturn(translogTransferMetadata); +when(mockTransfer.readMetadata(0)).thenReturn(translogTransferMetadata); when(mockTransfer.getRemoteTranslogTransferTracker()).thenReturn(remoteTranslogTransferTracker); AtomicLong downloadCounter = new AtomicLong(); doAnswer(invocation -> { if (downloadCounter.incrementAndGet() <= 1) { throw new NoSuchFileException("File not found"); } else if (downloadCounter.get() == 2) { Files.createFile(location.resolve(Translog.getCommitCheckpointFileName(generation))); } return null; }).when(mockTransfer).downloadTranslogsParallel(any(), any(), anyInt()); Suggestion importance[1-10]: 6 __ Why: The second mock uses `readMetadata()` (no-arg) while the `download` call passes `timestamp=0`, which likely invokes `readMetadata(0)`. This mismatch could cause the test to fail or behave unexpectedly, making this a valid correctness concern.	Low
General	Reduce flakiness in timing-based test assertion The timing-based assertion (`elapsed < 3 * 2 * delayMs`) is fragile and can produce flaky test failures in slow CI environments or under load. The assertion boundary of `300ms` (3 * 2 * 50ms) provides no safety margin. Consider either removing the timing assertion or using a more generous threshold (e.g., `elapsed < 3 * 2 * delayMs * 3`) to reduce flakiness. server/src/test/java/org/opensearch/index/translog/transfer/TranslogTransferManagerTests.java [1039-1049] long start = System.currentTimeMillis(); parallelManager.downloadTranslogsParallel(pairs, location, 4); long elapsed = System.currentTimeMillis() - start; // All 6 files should exist for (long gen = 1; gen <= 3; gen++) { assertTrue(Files.exists(location.resolve("translog-" + gen + ".tlog"))); assertTrue(Files.exists(location.resolve("translog-" + gen + ".ckp"))); } -// Parallel execution should be faster than sequential (3 * 2 * delayMs) -assertTrue("Expected parallel download faster than sequential, elapsed=" + elapsed, elapsed < 3 * 2 * delayMs); +// Parallel execution should be significantly faster than sequential (3 * 2 * delayMs) +// Use a generous multiplier to avoid flakiness in slow environments +assertTrue("Expected parallel download faster than sequential, elapsed=" + elapsed, elapsed < 3 * 2 * delayMs * 3); Suggestion importance[1-10]: 5 __ Why: The timing assertion with a tight boundary of `300ms` is prone to flakiness in slow CI environments. Using a more generous multiplier reduces false failures while still validating parallel behavior.	Low

github-actions · 2026-03-18T05:25:50Z

Persistent review updated to latest commit de9745c

github-actions · 2026-03-18T05:40:11Z

❌ Gradle check result for de9745c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-03-18T16:20:23Z

Persistent review updated to latest commit 680aff8

Signed-off-by: Thy Tran <58045538+ThyTran1402@users.noreply.github.com>

github-actions · 2026-03-18T16:29:16Z

Persistent review updated to latest commit f5f393a

github-actions · 2026-03-18T16:46:47Z

❌ Gradle check result for f5f393a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

ThyTran1402 requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners March 18, 2026 05:15

github-actions bot added enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Storage:Remote labels Mar 18, 2026

github-project-automation bot added this to Storage Project Board Mar 18, 2026

ThyTran1402 force-pushed the fix/concurrent_translog_download_failover branch from de9745c to 680aff8 Compare March 18, 2026 16:18

ThyTran1402 added 3 commits March 18, 2026 12:25

fixed concurrent translog download failover

5a9a734

Signed-off-by: Thy Tran <58045538+ThyTran1402@users.noreply.github.com>

updated changelog

cd92a88

Signed-off-by: Thy Tran <58045538+ThyTran1402@users.noreply.github.com>

cleanup

f5f393a

Signed-off-by: Thy Tran <58045538+ThyTran1402@users.noreply.github.com>

ThyTran1402 force-pushed the fix/concurrent_translog_download_failover branch from 680aff8 to f5f393a Compare March 18, 2026 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed concurrent translog download failover#20906

fixed concurrent translog download failover#20906
ThyTran1402 wants to merge 3 commits intoopensearch-project:mainfrom
ThyTran1402:fix/concurrent_translog_download_failover

ThyTran1402 commented Mar 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ThyTran1402 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit f5f393a)

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ThyTran1402 commented Mar 18, 2026 •

edited

Loading

github-actions bot commented Mar 18, 2026 •

edited

Loading

(Review updated until commit `f5f393a`)

github-actions bot commented Mar 18, 2026 •

edited

Loading