Skip to content

fixed concurrent translog download failover#20906

Open
ThyTran1402 wants to merge 3 commits intoopensearch-project:mainfrom
ThyTran1402:fix/concurrent_translog_download_failover
Open

fixed concurrent translog download failover#20906
ThyTran1402 wants to merge 3 commits intoopensearch-project:mainfrom
ThyTran1402:fix/concurrent_translog_download_failover

Conversation

@ThyTran1402
Copy link
Contributor

@ThyTran1402 ThyTran1402 commented Mar 18, 2026

Description

During failover and primary relocations, OpenSearch downloads all translog generations from remote store before opening the engine. Previously this was done sequentially and one generation at a time in a blocking for-loop inside RemoteFsTranslog.downloadOnce(). Since translog generations are independent of each other, this was a pure serial I/O bottleneck that scaled linearly with the number of retained generations.

This change parallelises translog generation downloads using the existing TRANSLOG_TRANSFER threadpool (already used for uploads), following the same CountDownLatch + worker-queue pattern established for concurrent segment downloads for #10519.

Related Issues

Resolves #10826

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

PR Reviewer Guide 🔍

(Review updated until commit f5f393a)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add new dynamic setting for max concurrent translog download streams

Relevant files:

  • server/src/main/java/org/opensearch/indices/recovery/RecoverySettings.java
  • server/src/main/java/org/opensearch/common/settings/ClusterSettings.java

Sub-PR theme: Implement parallel translog download in TranslogTransferManager

Relevant files:

  • server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java
  • server/src/test/java/org/opensearch/index/translog/transfer/TranslogTransferManagerTests.java

Sub-PR theme: Wire parallel translog download into RemoteFsTranslog and IndexShard

Relevant files:

  • server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java
  • server/src/main/java/org/opensearch/index/shard/IndexShard.java
  • server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java

⚡ Recommended focus areas for review

Error Propagation

In downloadTranslogsParallel, when a timeout occurs, queue.clear() signals workers to stop but the latch may not have counted down yet. After the timeout IOException is thrown, the method returns without waiting for in-flight workers to complete. This could lead to concurrent file writes to location after the caller believes the operation has failed, potentially corrupting partially-written translog files.

try {
    long timeoutMillis = remoteStoreSettings.getClusterRemoteTranslogTransferTimeout().millis();
    if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) {
        queue.clear(); // signal workers to stop
        throw new IOException("Timed out waiting for parallel translog downloads");
    }
} catch (InterruptedException e) {
    Thread.currentThread().interrupt();
    throw new IOException("Interrupted during parallel translog download", e);
}

if (failure.get() != null) {
    throw failure.get();
}
Thread Pool Null Check

In downloadTranslogsParallel, when threadPool is null, the code falls back to threads = 1 and executes sequentially. However, the null check happens after Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams), and the threadPool executor is accessed unconditionally at line 372 inside the parallel branch. If threadPool is null but threads > 1 somehow (which cannot happen with current logic but is fragile), it would NPE. More importantly, the threadPool.info(...) call at line 354 could NPE if threadPool is non-null but the TRANSLOG_TRANSFER executor is not registered.

int threads = Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams);
if (threadPool != null) {
    threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax());
} else {
    threads = 1;
}
Backward Compatibility

The old single-arg constructor for RemoteFsTranslog now delegates to the new constructor with null for recoverySettings, and the downloadOnce path uses maxConcurrentStreams = 1 when recoverySettings is null. This means existing callers that don't pass RecoverySettings will always download sequentially. Verify that all production call sites (especially in IndexShard) pass the RecoverySettings correctly, as missing it silently degrades to sequential behavior without any warning.

int maxConcurrentStreams = recoverySettings != null
    ? recoverySettings.getMaxConcurrentTranslogDownloadStreams()
    : 1;
download(translogTransferManager, location, logger, config.shouldSeedRemote(), 0, maxConcurrentStreams);
Test Flakiness

testDownloadTranslogsParallelMultipleGenerations uses Thread.sleep(delayMs) (50ms) inside mock answers to simulate concurrent downloads. This timing-based approach can be flaky in slow CI environments. The test also doesn't verify that downloads actually ran in parallel (e.g., by checking elapsed time or a concurrency counter), so it doesn't validate the core behavior being tested.

long delayMs = 50;
for (long gen = 1; gen <= 3; gen++) {
    String tlogFile = "translog-" + gen + ".tlog";
    String ckpFile = "translog-" + gen + ".ckp";
    when(transferService.downloadBlob(any(BlobPath.class), eq(tlogFile))).thenAnswer(invocation -> {
        Thread.sleep(delayMs);
        return new java.io.ByteArrayInputStream(tlogBytes);
    });
    when(transferService.downloadBlob(any(BlobPath.class), eq(ckpFile))).thenAnswer(invocation -> {
        Thread.sleep(delayMs);
        return new java.io.ByteArrayInputStream(ckpBytes);
    });
}

Path location = createTempDir();
List<org.opensearch.common.collect.Tuple<String, String>> pairs = new ArrayList<>();
pairs.add(new org.opensearch.common.collect.Tuple<>("1", "3"));
pairs.add(new org.opensearch.common.collect.Tuple<>("1", "2"));
pairs.add(new org.opensearch.common.collect.Tuple<>("1", "1"));

parallelManager.downloadTranslogsParallel(pairs, location, 4);

// All 6 files should exist
for (long gen = 1; gen <= 3; gen++) {
    assertTrue(Files.exists(location.resolve("translog-" + gen + ".tlog")));
    assertTrue(Files.exists(location.resolve("translog-" + gen + ".ckp")));
}

@github-actions
Copy link
Contributor

github-actions bot commented Mar 18, 2026

PR Code Suggestions ✨

Latest suggestions up to f5f393a

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Catch unchecked exceptions in worker threads

The worker only catches IOException, but downloadTranslog or underlying code could
throw unchecked exceptions (e.g., RuntimeException). If an unchecked exception
escapes the worker's run method, the latch.countDown() in finally will still
execute, but the exception will be swallowed by the executor and the failure will
not be propagated to the caller, potentially causing a silent partial download.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [378-383]

 } catch (IOException e) {
     failure.compareAndSet(null, e);
+    queue.clear();
+} catch (Exception e) {
+    failure.compareAndSet(null, new IOException("Unexpected error during parallel translog download", e));
     queue.clear();
 } finally {
     latch.countDown();
 }
Suggestion importance[1-10]: 7

__

Why: This is a valid concern — unchecked exceptions thrown by worker threads would be swallowed by the executor, leading to silent partial downloads. Adding a catch for Exception to wrap and propagate them via failure is a correct and important fix for robustness.

Medium
Guard against null thread pool info

threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER) can return null if the thread
pool name does not exist, which would cause a NullPointerException. A null check
should be added before calling .getMax().

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [352-357]

 int threads = Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams);
 if (threadPool != null) {
-    threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax());
+    ThreadPool.Info info = threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER);
+    if (info != null) {
+        threads = Math.min(threads, info.getMax());
+    } else {
+        threads = 1;
+    }
 } else {
     threads = 1;
 }
Suggestion importance[1-10]: 6

__

Why: threadPool.info() can return null for an unknown thread pool name, which would cause a NullPointerException. The null check is a valid defensive improvement, though in practice TRANSLOG_TRANSFER should always be registered.

Low
Handle lingering threads after timeout

When a timeout occurs, latch.await returns but the worker threads may still be
running and accessing shared state (the queue and failure reference). Clearing the
queue signals them to stop, but there is no guarantee they have finished before the
method returns and the caller potentially deletes or modifies the location
directory. Consider adding a secondary bounded wait or tracking thread completion
more robustly to avoid race conditions on the filesystem after timeout.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [387-396]

 try {
     long timeoutMillis = remoteStoreSettings.getClusterRemoteTranslogTransferTimeout().millis();
     if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) {
         queue.clear(); // signal workers to stop
+        // Wait a short grace period for workers to observe the cleared queue and exit
+        latch.await(5, TimeUnit.SECONDS);
         throw new IOException("Timed out waiting for parallel translog downloads");
     }
 } catch (InterruptedException e) {
     Thread.currentThread().interrupt();
     throw new IOException("Interrupted during parallel translog download", e);
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion adds a secondary latch.await after timeout to give workers a grace period to finish, which is a reasonable defensive measure. However, the improved_code itself has a bug: the second latch.await call can throw InterruptedException but is not handled, and the overall approach of waiting after a timeout may not be the cleanest solution. The issue is real but the fix is incomplete.

Low
General
Log fallback to sequential download mode

When recoverySettings is null, the fallback value is 1 (sequential), which silently
disables the parallel download feature. This can happen when the no-RecoverySettings
constructor is used. Consider logging a debug/warning message when falling back to
sequential mode so operators are aware that parallel download is not active.

server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [189-192]

-int maxConcurrentStreams = recoverySettings != null
-    ? recoverySettings.getMaxConcurrentTranslogDownloadStreams()
-    : 1;
+int maxConcurrentStreams;
+if (recoverySettings != null) {
+    maxConcurrentStreams = recoverySettings.getMaxConcurrentTranslogDownloadStreams();
+} else {
+    logger.debug("RecoverySettings not available; falling back to sequential translog download");
+    maxConcurrentStreams = 1;
+}
 download(translogTransferManager, location, logger, config.shouldSeedRemote(), 0, maxConcurrentStreams);
Suggestion importance[1-10]: 2

__

Why: Adding a debug log when falling back to sequential mode is a minor observability improvement. The suggestion is valid but has low impact since the fallback behavior is already documented in the code structure.

Low

Previous suggestions

Suggestions up to commit 680aff8
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add cancellation flag to stop workers on timeout

When a timeout occurs, the method throws an IOException but the worker threads may
still be running since queue.clear() only prevents new work from being picked up —
threads already inside downloadTranslog() will continue executing. This can lead to
partial file writes or resource contention after the caller has already failed.
Consider using an AtomicBoolean cancellation flag checked inside the worker loop, or
interrupting the worker threads explicitly on timeout.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [387-396]

+AtomicBoolean cancelled = new AtomicBoolean(false);
+// ... in worker loop: while (!cancelled.get() && failure.get() == null && (pair = queue.poll()) != null)
+
 try {
     long timeoutMillis = remoteStoreSettings.getClusterRemoteTranslogTransferTimeout().millis();
     if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) {
-        queue.clear(); // signal workers to stop
+        cancelled.set(true);
+        queue.clear();
         throw new IOException("Timed out waiting for parallel translog downloads");
     }
 } catch (InterruptedException e) {
+    cancelled.set(true);
     Thread.currentThread().interrupt();
     throw new IOException("Interrupted during parallel translog download", e);
 }
Suggestion importance[1-10]: 5

__

Why: The concern about lingering worker threads after timeout is valid, but the improved_code only shows a partial solution (the comment // ... in worker loop is not actual code), making it incomplete. The issue is real but the fix as presented is not directly applicable.

Low
General
Restore coverage for both retry-triggering exception types

The test for "file found in second attempt" removed the original
when(...).thenThrow(toThrow).thenReturn(true) mock but the new doAnswer mock
increments downloadCounter on each call to downloadTranslogsParallel. However, the
retry loop in download() calls downloadOnce which calls downloadTranslogsParallel
once per attempt — so the counter logic is correct. But the test no longer verifies
that a FileNotFoundException (as opposed to only NoSuchFileException) also triggers
a retry, reducing test coverage for the failover behavior.

server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java [1720-1728]

+String msg = "File not found";
+Exception toThrow = randomBoolean() ? new NoSuchFileException(msg) : new FileNotFoundException(msg);
+AtomicLong downloadCounter = new AtomicLong();
 doAnswer(invocation -> {
     if (downloadCounter.incrementAndGet() <= 1) {
-        throw new NoSuchFileException("File not found");
+        if (toThrow instanceof NoSuchFileException) throw (NoSuchFileException) toThrow;
+        else throw (FileNotFoundException) toThrow;
     } else if (downloadCounter.get() == 2) {
         Files.createFile(location.resolve(Translog.getCommitCheckpointFileName(generation)));
     }
     return null;
 }).when(mockTransfer).downloadTranslogsParallel(any(), any(), anyInt());
 
-// no exception thrown
-RemoteFsTranslog.download(mockTransfer, location, logger, false, 0);
-
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the refactored test no longer covers FileNotFoundException as a retry trigger, which reduces test coverage for an important failover behavior. The improved code restores this coverage with a valid approach.

Low
Consolidate null threadPool check with sequential fallback

If threadPool is null, the code falls back to threads = 1 (sequential), but the
threadPool.executor(...) call later will throw a NullPointerException if somehow
threads > 1 is reached through a different code path. More critically, when
threadPool is null and threads was already set to Math.min(size,
maxConcurrentStreams) before this block, setting threads = 1 is correct but the
sequential fallback block below only triggers on threads <= 1, so this is safe —
however, the null check should be consolidated with the sequential fallback to make
the logic clearer and avoid future bugs.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [353-357]

-if (threadPool != null) {
-    threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax());
-} else {
-    threads = 1;
+if (threadPool == null || threads <= 1) {
+    for (Tuple<String, String> p : generationPrimaryTermPairs) {
+        downloadTranslog(p.v1(), p.v2(), location);
+    }
+    return;
+}
+threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax());
+if (threads <= 1) {
+    for (Tuple<String, String> p : generationPrimaryTermPairs) {
+        downloadTranslog(p.v1(), p.v2(), location);
+    }
+    return;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion improves code clarity by consolidating the null check with the sequential fallback, but the improved_code duplicates the sequential loop twice which is not clean. The existing logic is functionally correct and safe, so this is a minor style improvement.

Low
Replace magic number with named constant

The default fallback value of 1 when recoverySettings is null means sequential
download, which is safe but may be inconsistent with the intent of the feature. More
importantly, the recoverySettings field is set just before this code block, so it
will always be the value passed to the constructor (possibly null). The null check
is correct, but the magic number 1 should be extracted to a named constant for
clarity and maintainability.

server/src/main/java/org/opensearch/index/translog/RemoteFsTranslog.java [189-192]

 int maxConcurrentStreams = recoverySettings != null
     ? recoverySettings.getMaxConcurrentTranslogDownloadStreams()
-    : 1;
+    : RecoverySettings.DEFAULT_MAX_CONCURRENT_TRANSLOG_DOWNLOAD_STREAMS;
 download(translogTransferManager, location, logger, config.shouldSeedRemote(), 0, maxConcurrentStreams);
Suggestion importance[1-10]: 3

__

Why: Extracting the magic number 1 to a named constant improves maintainability, but the suggested constant RecoverySettings.DEFAULT_MAX_CONCURRENT_TRANSLOG_DOWNLOAD_STREAMS doesn't exist in the PR and would require additional changes to RecoverySettings.java.

Low
Suggestions up to commit de9745c
CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent NullPointerException when threadPool is null

When threadPool is null, the code proceeds to the parallel path and calls
threadPool.executor(...), which will throw a NullPointerException. If threadPool is
null and threads > 1, the code should fall back to sequential execution instead of
attempting parallel execution.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [352-363]

 int threads = Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams);
 if (threadPool != null) {
     threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax());
+} else {
+    threads = 1;
 }
 
 // Fall back to sequential for single thread (avoids overhead)
 if (threads <= 1) {
     for (Tuple<String, String> p : generationPrimaryTermPairs) {
         downloadTranslog(p.v1(), p.v2(), location);
     }
     return;
 }
Suggestion importance[1-10]: 9

__

Why: This is a critical bug: when threadPool is null and maxConcurrentStreams > 1 with multiple generations, the code skips the null-check branch but then calls threadPool.executor(...) at line 370, causing a NullPointerException. Setting threads = 1 when threadPool is null correctly forces sequential fallback.

High
Clear queue on timeout to stop worker threads

When a timeout occurs, the worker threads may still be running and processing items
from the queue. You should clear the queue and signal workers to stop when a timeout
is detected, similar to how failures are handled. Without this, threads may continue
running after the calling thread has already thrown the timeout exception.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [385-397]

 try {
     long timeoutMillis = remoteStoreSettings.getClusterRemoteTranslogTransferTimeout().millis();
     if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) {
+        queue.clear();
         throw new IOException("Timed out waiting for parallel translog downloads");
     }
 } catch (InterruptedException e) {
     Thread.currentThread().interrupt();
+    queue.clear();
     throw new IOException("Interrupted during parallel translog download", e);
 }
 
 if (failure.get() != null) {
     throw failure.get();
 }
Suggestion importance[1-10]: 6

__

Why: When a timeout occurs, worker threads continue consuming from the queue unnecessarily. Adding queue.clear() on timeout and interrupt paths is a valid improvement to prevent wasted work, though threads will naturally stop once the queue is empty anyway.

Low
General
Replace flaky timing assertion with concurrency verification

This timing-based assertion is inherently flaky and can fail on slow CI machines or
under high load. The test asserts that parallel execution completes in less than 3 *
2 * 50ms = 300ms, but this is not a reliable guarantee. Consider verifying
parallelism through other means, such as tracking concurrent execution counts,
rather than wall-clock time.

server/src/test/java/org/opensearch/index/translog/transfer/TranslogTransferManagerTests.java [1019-1049]

-long delayMs = 50;
-...
-// Parallel execution should be faster than sequential (3 * 2 * delayMs)
-assertTrue("Expected parallel download faster than sequential, elapsed=" + elapsed, elapsed < 3 * 2 * delayMs);
+// Use a counter to verify concurrent execution instead of timing
+AtomicInteger maxConcurrent = new AtomicInteger(0);
+AtomicInteger currentConcurrent = new AtomicInteger(0);
+// ... instrument mocks to track concurrency and assert maxConcurrent.get() > 1
Suggestion importance[1-10]: 4

__

Why: The timing-based assertion elapsed < 3 * 2 * delayMs is inherently flaky on slow CI machines. However, the improved code snippet is incomplete and doesn't provide a ready-to-use replacement, making it harder to apply directly.

Low
Suggestions up to commit d9a49de
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null threadPool in parallel path

When threadPool is null, the code falls through to the parallel path and submits
tasks via threadPool.executor(...), which will throw a NullPointerException. If
threadPool is null and threads > 1, the parallel execution path must not be taken.
Add a null-check guard to force sequential execution when threadPool is null.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [352-366]

 int threads = Math.min(generationPrimaryTermPairs.size(), maxConcurrentStreams);
 if (threadPool != null) {
     threads = Math.min(threads, threadPool.info(ThreadPool.Names.TRANSLOG_TRANSFER).getMax());
 }
 
-// Fall back to sequential for single thread (avoids overhead)
-if (threads <= 1) {
+// Fall back to sequential for single thread or no threadPool (avoids overhead / NPE)
+if (threads <= 1 || threadPool == null) {
     for (Tuple<String, String> p : generationPrimaryTermPairs) {
         downloadTranslog(p.v1(), p.v2(), location);
     }
     return;
 }
 
 ConcurrentLinkedQueue<Tuple<String, String>> queue = new ConcurrentLinkedQueue<>(generationPrimaryTermPairs);
 CountDownLatch latch = new CountDownLatch(threads);
Suggestion importance[1-10]: 9

__

Why: When threadPool is null and maxConcurrentStreams > 1 with multiple generations, the code skips the null-check branch and proceeds to call threadPool.executor(...), causing a NullPointerException. This is a real bug that needs to be fixed.

High
Signal workers to stop on timeout

When a timeout occurs, the worker threads are still running and may continue to
write to location or modify shared state after the calling thread has already thrown
an exception. You should interrupt or signal the workers to stop (e.g., by setting a
flag or clearing the queue) before throwing the timeout exception to avoid resource
leaks and partial writes.

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java [386-388]

 if (!latch.await(timeoutMillis, TimeUnit.MILLISECONDS)) {
+    queue.clear(); // signal workers to stop
     throw new IOException("Timed out waiting for parallel translog downloads");
 }
Suggestion importance[1-10]: 7

__

Why: When a timeout occurs, worker threads continue running and may cause partial writes or resource leaks. Clearing the queue on timeout is a simple and effective way to signal workers to stop, improving correctness.

Medium
General
Fix mismatched mock method in retry test

The test for "file found in second attempt" uses mockTransfer.readMetadata()
(no-arg) while the first mock used readMetadata(0). The download method called in
the test uses a timestamp of 0, so it likely calls readMetadata(0). If the no-arg
and single-arg overloads are different, the mock won't match and the test will fail
or behave unexpectedly.

server/src/test/java/org/opensearch/index/translog/RemoteFsTranslogTests.java [1715-1728]

 // File not found in first attempt . File found in second attempt.
 mockTransfer = mock(TranslogTransferManager.class);
-when(mockTransfer.readMetadata()).thenReturn(translogTransferMetadata);
+when(mockTransfer.readMetadata(0)).thenReturn(translogTransferMetadata);
 when(mockTransfer.getRemoteTranslogTransferTracker()).thenReturn(remoteTranslogTransferTracker);
 
 AtomicLong downloadCounter = new AtomicLong();
 doAnswer(invocation -> {
     if (downloadCounter.incrementAndGet() <= 1) {
         throw new NoSuchFileException("File not found");
     } else if (downloadCounter.get() == 2) {
         Files.createFile(location.resolve(Translog.getCommitCheckpointFileName(generation)));
     }
     return null;
 }).when(mockTransfer).downloadTranslogsParallel(any(), any(), anyInt());
Suggestion importance[1-10]: 6

__

Why: The second mock uses readMetadata() (no-arg) while the download call passes timestamp=0, which likely invokes readMetadata(0). This mismatch could cause the test to fail or behave unexpectedly, making this a valid correctness concern.

Low
Reduce flakiness in timing-based test assertion

The timing-based assertion (elapsed < 3 * 2 * delayMs) is fragile and can produce
flaky test failures in slow CI environments or under load. The assertion boundary of
300ms (3 * 2 * 50ms) provides no safety margin. Consider either removing the timing
assertion or using a more generous threshold (e.g., elapsed < 3 * 2 * delayMs * 3)
to reduce flakiness.

server/src/test/java/org/opensearch/index/translog/transfer/TranslogTransferManagerTests.java [1039-1049]

 long start = System.currentTimeMillis();
 parallelManager.downloadTranslogsParallel(pairs, location, 4);
 long elapsed = System.currentTimeMillis() - start;
 
 // All 6 files should exist
 for (long gen = 1; gen <= 3; gen++) {
     assertTrue(Files.exists(location.resolve("translog-" + gen + ".tlog")));
     assertTrue(Files.exists(location.resolve("translog-" + gen + ".ckp")));
 }
-// Parallel execution should be faster than sequential (3 * 2 * delayMs)
-assertTrue("Expected parallel download faster than sequential, elapsed=" + elapsed, elapsed < 3 * 2 * delayMs);
+// Parallel execution should be significantly faster than sequential (3 * 2 * delayMs)
+// Use a generous multiplier to avoid flakiness in slow environments
+assertTrue("Expected parallel download faster than sequential, elapsed=" + elapsed, elapsed < 3 * 2 * delayMs * 3);
Suggestion importance[1-10]: 5

__

Why: The timing assertion with a tight boundary of 300ms is prone to flakiness in slow CI environments. Using a more generous multiplier reduces false failures while still validating parallel behavior.

Low

@github-actions
Copy link
Contributor

Persistent review updated to latest commit de9745c

@github-actions
Copy link
Contributor

❌ Gradle check result for de9745c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ThyTran1402 ThyTran1402 force-pushed the fix/concurrent_translog_download_failover branch from de9745c to 680aff8 Compare March 18, 2026 16:18
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 680aff8

Signed-off-by: Thy Tran <58045538+ThyTran1402@users.noreply.github.com>
Signed-off-by: Thy Tran <58045538+ThyTran1402@users.noreply.github.com>
Signed-off-by: Thy Tran <58045538+ThyTran1402@users.noreply.github.com>
@ThyTran1402 ThyTran1402 force-pushed the fix/concurrent_translog_download_failover branch from 680aff8 to f5f393a Compare March 18, 2026 16:26
@github-actions
Copy link
Contributor

Persistent review updated to latest commit f5f393a

@github-actions
Copy link
Contributor

❌ Gradle check result for f5f393a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Storage:Remote

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Remote Translog] Download Translog files concurrently

1 participant