Skip to content

Conversation

@RS146BIJAY
Copy link
Contributor

Description

Adding chaos testing for Context Aware Segments during indexing, refresh and update/delete operations for following scenarios:

  1. For disk full scenarios.
  2. For JRE crash scenarios
  3. For OutOfFileDescriptors scenarios

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 23, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This pull request introduces test infrastructure enhancements for engine reliability testing. It adds a utility class for suppressing merge scheduler exceptions, three new chaos test classes for simulating disk-full, JRE-crash, and file descriptor exhaustion scenarios, and extends the test framework with context-aware document creation helpers.

Changes

Cohort / File(s) Summary
Test Utility
server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java
New utility class with suppressMergePolicyException() method that suppresses thread-exception output from ConcurrentMergeScheduler during test execution.
Chaos Test Classes
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.java, InternalEngineOnJRECrashTests.java, InternalEngineOnOutOfFileDescriptorsTests.java
Three new test classes for engine failure scenarios: InternalEngineOnDiskWriterFullTests verifies exception handling when disk fills during document indexing and refresh; InternalEngineOnJRECrashTests simulates JRE interruption with variable crash delays (10ms–10s) and validates data persistence; InternalEngineOnOutOfFileDescriptorsTests injects IO errors and verifies engine behavior under file descriptor exhaustion. All classes use mock directory wrappers, custom IndexWriter configuration, and CONTEXT_AWARE_MIGRATION_EXPERIMENTAL_FLAG.
Test Framework Enhancement
test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java
Adds testContextSpecificDocumentWithTenantField() helper method for creating context-aware documents with tenant grouping criteria; enhances context-aware index mapping with grouping_criteria field and script-based context-aware grouping configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lucene, Indexing & Search

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description covers the main purpose (chaos testing for Context Aware Segments) and lists three key scenarios, but is missing related issue reference and checklist completion required by the template. Add the 'Resolves #[Issue number]' section and complete the checklist items to fully align with the repository's pull request description template.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding chaos test cases for Context Aware Segments, which matches the pull request's objective of introducing multiple chaos testing scenarios.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@RS146BIJAY RS146BIJAY changed the title Adding chaos testing for Context Aware Segments Adding chaos test cases for Context Aware Segments Jan 23, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java`:
- Around line 95-99: The test currently calls started.await() after
testThread.start(), which can hang the CI if the worker fails to signal; modify
the wait to use a bounded timeout (e.g., started.await(timeout, timeUnit)) and
fail the test with a clear error if the latch times out. Update the code around
testThread.start()/started.await() in InternalEngineOnJRECrashTests so the start
latch uses a sensible timeout and asserts that the await returned true (or
throws/records failure) before proceeding.

In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java`:
- Around line 44-101: The test reuses hitException for both the initial
low-error phase and the forced-failure phase, which can hide a missing exception
in phase‑2; introduce a separate boolean (e.g., hitForcedFailure) and use it
only for the second try/catch that indexes with
dir.setRandomIOExceptionRateOnOpen(rate) and the subsequent
engine.refresh("testing"), set hitForcedFailure = true in that catch
(EngineException e) and assertTrue(hitForcedFailure) after that block while
keeping the original hitException checks/assertFalse for the first phase (the
variables to change are hitException and the new hitForcedFailure used around
the second indexing/refresh try/catch).

In
`@test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java`:
- Around line 826-836: The override of index(Index index) unconditionally calls
ParseContext.Document.setGroupingCriteria("grouping_criteria") which will
overwrite caller-provided tenant/grouping values (e.g., those set by
testContextSpecificDocumentWithTenantField); change the logic in the index(Index
index) override (inside the config().getIndexSettings().isContextAwareEnabled()
branch) to only call setGroupingCriteria("grouping_criteria") when the document
has no grouping criteria already (e.g., check the document's existing grouping
value via the appropriate getter or null/empty check before calling
setGroupingCriteria), so provided grouping criteria are preserved.
♻️ Duplicate comments (1)
test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java (1)

854-863: Same grouping-criteria overwrite concern here.
This override has the same unconditional clobbering behavior as the earlier block.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5041d43 and 231cb15.

📒 Files selected for processing (5)
  • server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java
  • server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.java
  • server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java
  • server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java
  • test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java
🧰 Additional context used
🧬 Code graph analysis (2)
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java (3)
libs/core/src/main/java/org/opensearch/core/common/bytes/BytesArray.java (1)
  • BytesArray (48-151)
server/src/main/java/org/opensearch/index/engine/InternalEngine.java (1)
  • InternalEngine (143-3057)
server/src/main/java/org/opensearch/common/util/FeatureFlags.java (1)
  • FeatureFlags (29-341)
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java (2)
server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java (1)
  • IndexWriterUtil (15-24)
server/src/main/java/org/opensearch/common/util/FeatureFlags.java (1)
  • FeatureFlags (29-341)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
  • GitHub Check: gradle-check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: Mend Security Check
🔇 Additional comments (7)
server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java (1)

15-23: LGTM — utility cleanly suppresses merge-scheduler noise in tests.

test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java (2)

398-403: Nice addition for tenant-scoped docs.


1695-1696: LGTM — mapping update aligns with context-aware grouping.

server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.java (2)

90-133: LGTM — refresh failure path is asserted clearly.


71-83: Improve exception handling to catch the actual failures from disk-full.
The suggested IOException | AlreadyClosedException does not account for RefreshFailedEngineException, which is thrown by engine.refresh() during disk-full conditions (confirmed by the companion test testEngineRefreshOnDiskFull that explicitly expects this). Since the loop calls both engine.index() and engine.refresh(), catch EngineException instead, which covers RefreshFailedEngineException and provides a more precise failure specification than broad Exception.

🧹 Suggested tightening
-            } catch (Exception ex) {
+            } catch (EngineException ex) {
                 hitException = true;
             }

Likely an incorrect or invalid review comment.

server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java (2)

37-55: LGTM — workload loop is straightforward and tenant-aware.


122-140: LGTM — good coverage across crash delays.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +95 to +96
// Start the indexing thread.
testThread.start();
// Wait for the indexing thread to start.
started.await();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a timeout to the start latch to prevent hangs.
started.await() without a timeout can hang CI if the worker thread fails to start.

⏱️ Suggested safeguard
-                started.await();
+                assertTrue(started.await(5, TimeUnit.SECONDS));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Start the indexing thread.
testThread.start();
// Wait for the indexing thread to start.
started.await();
// Start the indexing thread.
testThread.start();
// Wait for the indexing thread to start.
assertTrue(started.await(5, TimeUnit.SECONDS));
🤖 Prompt for AI Agents
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java`
around lines 95 - 99, The test currently calls started.await() after
testThread.start(), which can hang the CI if the worker fails to signal; modify
the wait to use a bounded timeout (e.g., started.await(timeout, timeUnit)) and
fail the test with a clear error if the latch times out. Update the code around
testThread.start()/started.await() in InternalEngineOnJRECrashTests so the start
latch uses a sensible timeout and asserts that the await returned true (or
throws/records failure) before proceeding.

Comment on lines 44 to 117
boolean hitException = false;
final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.NO_OPS_PERFORMED);
final IndexSettings indexSettings = IndexSettingsModule.newIndexSettings(
"test",
Settings.builder()
.put(defaultSettings.getSettings())
.put(IndexSettings.INDEX_CONTEXT_AWARE_ENABLED_SETTING.getKey(), true)
.build()
);
try (
Store store = createStore(dir);
InternalEngine engine = createEngine(
indexSettings,
store,
primaryTranslogDir,
newMergePolicy(),
indexWriterFactory,
null,
globalCheckpoint::get
)
) {
int numDocsFirstSegment = randomIntBetween(50, 100);
try {
for (int i = 0; i < numDocsFirstSegment; i++) {
String id = Integer.toString(i);
ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null);
engine.index(indexForDoc(doc));
}
} catch (IOException ex) {
hitException = true;
}

assertFalse(hitException);
assertTrue(DirectoryReader.indexExists(dir));

try {
engine.refresh("testing");
} catch (EngineException e) {
hitException = true;
}

assertTrue(DirectoryReader.indexExists(dir));
rate = 1.0;
dir.setRandomIOExceptionRateOnOpen(rate);
try {
for (int i = numDocsFirstSegment; i < numDocsFirstSegment + numDocsFirstSegment; i++) {
String id = Integer.toString(i);
ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null);
engine.index(indexForDoc(doc));
}

engine.refresh("testing");
} catch (EngineException e) {
hitException = true;
}

assertTrue(hitException);
assertTrue(DirectoryReader.indexExists(dir));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Separate failure flags for the two phases.
hitException is reused across the low-rate and 100% rate phases, so a phase‑1 failure can mask a missing phase‑2 failure. A dedicated flag for the forced-failure phase makes the assertion precise.

🧪 Suggested adjustment
-            rate = 1.0;
+            boolean phase2Failed = false;
+            rate = 1.0;
             dir.setRandomIOExceptionRateOnOpen(rate);
             try {
                 for (int i = numDocsFirstSegment; i < numDocsFirstSegment + numDocsFirstSegment; i++) {
                     String id = Integer.toString(i);
                     ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null);
                     engine.index(indexForDoc(doc));
                 }

                 engine.refresh("testing");
             } catch (EngineException e) {
-                hitException = true;
+                phase2Failed = true;
             }

-            assertTrue(hitException);
+            assertTrue(phase2Failed);
🤖 Prompt for AI Agents
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java`
around lines 44 - 101, The test reuses hitException for both the initial
low-error phase and the forced-failure phase, which can hide a missing exception
in phase‑2; introduce a separate boolean (e.g., hitForcedFailure) and use it
only for the second try/catch that indexes with
dir.setRandomIOExceptionRateOnOpen(rate) and the subsequent
engine.refresh("testing"), set hitForcedFailure = true in that catch
(EngineException e) and assertTrue(hitForcedFailure) after that block while
keeping the original hitException checks/assertFalse for the first phase (the
variables to change are hitException and the new hitForcedFailure used around
the second indexing/refresh try/catch).

Comment on lines +826 to +837

@Override
public IndexResult index(Index index) throws IOException {
if (config().getIndexSettings().isContextAwareEnabled()) {
for (ParseContext.Document doc : index.docs()) {
doc.setGroupingCriteria("grouping_criteria");
}
}

return super.index(index);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid overwriting an explicit grouping criteria.
The unconditional setGroupingCriteria("grouping_criteria") will clobber tenant-specific values set by testContextSpecificDocumentWithTenantField(...). Consider only applying a default when none is provided.

🔧 Suggested guard to preserve caller-provided criteria
-                    if (config().getIndexSettings().isContextAwareEnabled()) {
-                        for (ParseContext.Document doc : index.docs()) {
-                            doc.setGroupingCriteria("grouping_criteria");
-                        }
-                    }
+                    if (config().getIndexSettings().isContextAwareEnabled()) {
+                        for (ParseContext.Document doc : index.docs()) {
+                            if (doc.getGroupingCriteria() == null) {
+                                doc.setGroupingCriteria("grouping_criteria");
+                            }
+                        }
+                    }
🤖 Prompt for AI Agents
In `@test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java`
around lines 826 - 836, The override of index(Index index) unconditionally calls
ParseContext.Document.setGroupingCriteria("grouping_criteria") which will
overwrite caller-provided tenant/grouping values (e.g., those set by
testContextSpecificDocumentWithTenantField); change the logic in the index(Index
index) override (inside the config().getIndexSettings().isContextAwareEnabled()
branch) to only call setGroupingCriteria("grouping_criteria") when the document
has no grouping criteria already (e.g., check the document's existing grouping
value via the appropriate getter or null/empty check before calling
setGroupingCriteria), so provided grouping criteria are preserved.

@github-actions
Copy link
Contributor

❌ Gradle check result for 231cb15: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: RS146BIJAY <rishavsagar4b1@gmail.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for dbedebd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for dbedebd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant