-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Adding chaos test cases for Context Aware Segments #20468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis pull request introduces test infrastructure enhancements for engine reliability testing. It adds a utility class for suppressing merge scheduler exceptions, three new chaos test classes for simulating disk-full, JRE-crash, and file descriptor exhaustion scenarios, and extends the test framework with context-aware document creation helpers. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java`:
- Around line 95-99: The test currently calls started.await() after
testThread.start(), which can hang the CI if the worker fails to signal; modify
the wait to use a bounded timeout (e.g., started.await(timeout, timeUnit)) and
fail the test with a clear error if the latch times out. Update the code around
testThread.start()/started.await() in InternalEngineOnJRECrashTests so the start
latch uses a sensible timeout and asserts that the await returned true (or
throws/records failure) before proceeding.
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java`:
- Around line 44-101: The test reuses hitException for both the initial
low-error phase and the forced-failure phase, which can hide a missing exception
in phase‑2; introduce a separate boolean (e.g., hitForcedFailure) and use it
only for the second try/catch that indexes with
dir.setRandomIOExceptionRateOnOpen(rate) and the subsequent
engine.refresh("testing"), set hitForcedFailure = true in that catch
(EngineException e) and assertTrue(hitForcedFailure) after that block while
keeping the original hitException checks/assertFalse for the first phase (the
variables to change are hitException and the new hitForcedFailure used around
the second indexing/refresh try/catch).
In
`@test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java`:
- Around line 826-836: The override of index(Index index) unconditionally calls
ParseContext.Document.setGroupingCriteria("grouping_criteria") which will
overwrite caller-provided tenant/grouping values (e.g., those set by
testContextSpecificDocumentWithTenantField); change the logic in the index(Index
index) override (inside the config().getIndexSettings().isContextAwareEnabled()
branch) to only call setGroupingCriteria("grouping_criteria") when the document
has no grouping criteria already (e.g., check the document's existing grouping
value via the appropriate getter or null/empty check before calling
setGroupingCriteria), so provided grouping criteria are preserved.
♻️ Duplicate comments (1)
test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java (1)
854-863: Same grouping-criteria overwrite concern here.
This override has the same unconditional clobbering behavior as the earlier block.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
server/src/test/java/org/apache/lucene/index/IndexWriterUtil.javaserver/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.javaserver/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.javaserver/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.javatest/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java
🧰 Additional context used
🧬 Code graph analysis (2)
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java (3)
libs/core/src/main/java/org/opensearch/core/common/bytes/BytesArray.java (1)
BytesArray(48-151)server/src/main/java/org/opensearch/index/engine/InternalEngine.java (1)
InternalEngine(143-3057)server/src/main/java/org/opensearch/common/util/FeatureFlags.java (1)
FeatureFlags(29-341)
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java (2)
server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java (1)
IndexWriterUtil(15-24)server/src/main/java/org/opensearch/common/util/FeatureFlags.java (1)
FeatureFlags(29-341)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: gradle-check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: Mend Security Check
🔇 Additional comments (7)
server/src/test/java/org/apache/lucene/index/IndexWriterUtil.java (1)
15-23: LGTM — utility cleanly suppresses merge-scheduler noise in tests.test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java (2)
398-403: Nice addition for tenant-scoped docs.
1695-1696: LGTM — mapping update aligns with context-aware grouping.server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnDiskWriterFullTests.java (2)
90-133: LGTM — refresh failure path is asserted clearly.
71-83: Improve exception handling to catch the actual failures from disk-full.
The suggestedIOException | AlreadyClosedExceptiondoes not account forRefreshFailedEngineException, which is thrown byengine.refresh()during disk-full conditions (confirmed by the companion testtestEngineRefreshOnDiskFullthat explicitly expects this). Since the loop calls bothengine.index()andengine.refresh(), catchEngineExceptioninstead, which coversRefreshFailedEngineExceptionand provides a more precise failure specification than broadException.🧹 Suggested tightening
- } catch (Exception ex) { + } catch (EngineException ex) { hitException = true; }Likely an incorrect or invalid review comment.
server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java (2)
37-55: LGTM — workload loop is straightforward and tenant-aware.
122-140: LGTM — good coverage across crash delays.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| // Start the indexing thread. | ||
| testThread.start(); | ||
| // Wait for the indexing thread to start. | ||
| started.await(); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a timeout to the start latch to prevent hangs.
started.await() without a timeout can hang CI if the worker thread fails to start.
⏱️ Suggested safeguard
- started.await();
+ assertTrue(started.await(5, TimeUnit.SECONDS));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Start the indexing thread. | |
| testThread.start(); | |
| // Wait for the indexing thread to start. | |
| started.await(); | |
| // Start the indexing thread. | |
| testThread.start(); | |
| // Wait for the indexing thread to start. | |
| assertTrue(started.await(5, TimeUnit.SECONDS)); | |
🤖 Prompt for AI Agents
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnJRECrashTests.java`
around lines 95 - 99, The test currently calls started.await() after
testThread.start(), which can hang the CI if the worker fails to signal; modify
the wait to use a bounded timeout (e.g., started.await(timeout, timeUnit)) and
fail the test with a clear error if the latch times out. Update the code around
testThread.start()/started.await() in InternalEngineOnJRECrashTests so the start
latch uses a sensible timeout and asserts that the await returned true (or
throws/records failure) before proceeding.
| boolean hitException = false; | ||
| final AtomicLong globalCheckpoint = new AtomicLong(SequenceNumbers.NO_OPS_PERFORMED); | ||
| final IndexSettings indexSettings = IndexSettingsModule.newIndexSettings( | ||
| "test", | ||
| Settings.builder() | ||
| .put(defaultSettings.getSettings()) | ||
| .put(IndexSettings.INDEX_CONTEXT_AWARE_ENABLED_SETTING.getKey(), true) | ||
| .build() | ||
| ); | ||
| try ( | ||
| Store store = createStore(dir); | ||
| InternalEngine engine = createEngine( | ||
| indexSettings, | ||
| store, | ||
| primaryTranslogDir, | ||
| newMergePolicy(), | ||
| indexWriterFactory, | ||
| null, | ||
| globalCheckpoint::get | ||
| ) | ||
| ) { | ||
| int numDocsFirstSegment = randomIntBetween(50, 100); | ||
| try { | ||
| for (int i = 0; i < numDocsFirstSegment; i++) { | ||
| String id = Integer.toString(i); | ||
| ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null); | ||
| engine.index(indexForDoc(doc)); | ||
| } | ||
| } catch (IOException ex) { | ||
| hitException = true; | ||
| } | ||
|
|
||
| assertFalse(hitException); | ||
| assertTrue(DirectoryReader.indexExists(dir)); | ||
|
|
||
| try { | ||
| engine.refresh("testing"); | ||
| } catch (EngineException e) { | ||
| hitException = true; | ||
| } | ||
|
|
||
| assertTrue(DirectoryReader.indexExists(dir)); | ||
| rate = 1.0; | ||
| dir.setRandomIOExceptionRateOnOpen(rate); | ||
| try { | ||
| for (int i = numDocsFirstSegment; i < numDocsFirstSegment + numDocsFirstSegment; i++) { | ||
| String id = Integer.toString(i); | ||
| ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null); | ||
| engine.index(indexForDoc(doc)); | ||
| } | ||
|
|
||
| engine.refresh("testing"); | ||
| } catch (EngineException e) { | ||
| hitException = true; | ||
| } | ||
|
|
||
| assertTrue(hitException); | ||
| assertTrue(DirectoryReader.indexExists(dir)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate failure flags for the two phases.
hitException is reused across the low-rate and 100% rate phases, so a phase‑1 failure can mask a missing phase‑2 failure. A dedicated flag for the forced-failure phase makes the assertion precise.
🧪 Suggested adjustment
- rate = 1.0;
+ boolean phase2Failed = false;
+ rate = 1.0;
dir.setRandomIOExceptionRateOnOpen(rate);
try {
for (int i = numDocsFirstSegment; i < numDocsFirstSegment + numDocsFirstSegment; i++) {
String id = Integer.toString(i);
ParsedDocument doc = testParsedDocument(id, null, testDocument(), B_1, null);
engine.index(indexForDoc(doc));
}
engine.refresh("testing");
} catch (EngineException e) {
- hitException = true;
+ phase2Failed = true;
}
- assertTrue(hitException);
+ assertTrue(phase2Failed);🤖 Prompt for AI Agents
In
`@server/src/test/java/org/opensearch/index/engine/chaos/InternalEngineOnOutOfFileDescriptorsTests.java`
around lines 44 - 101, The test reuses hitException for both the initial
low-error phase and the forced-failure phase, which can hide a missing exception
in phase‑2; introduce a separate boolean (e.g., hitForcedFailure) and use it
only for the second try/catch that indexes with
dir.setRandomIOExceptionRateOnOpen(rate) and the subsequent
engine.refresh("testing"), set hitForcedFailure = true in that catch
(EngineException e) and assertTrue(hitForcedFailure) after that block while
keeping the original hitException checks/assertFalse for the first phase (the
variables to change are hitException and the new hitForcedFailure used around
the second indexing/refresh try/catch).
|
|
||
| @Override | ||
| public IndexResult index(Index index) throws IOException { | ||
| if (config().getIndexSettings().isContextAwareEnabled()) { | ||
| for (ParseContext.Document doc : index.docs()) { | ||
| doc.setGroupingCriteria("grouping_criteria"); | ||
| } | ||
| } | ||
|
|
||
| return super.index(index); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid overwriting an explicit grouping criteria.
The unconditional setGroupingCriteria("grouping_criteria") will clobber tenant-specific values set by testContextSpecificDocumentWithTenantField(...). Consider only applying a default when none is provided.
🔧 Suggested guard to preserve caller-provided criteria
- if (config().getIndexSettings().isContextAwareEnabled()) {
- for (ParseContext.Document doc : index.docs()) {
- doc.setGroupingCriteria("grouping_criteria");
- }
- }
+ if (config().getIndexSettings().isContextAwareEnabled()) {
+ for (ParseContext.Document doc : index.docs()) {
+ if (doc.getGroupingCriteria() == null) {
+ doc.setGroupingCriteria("grouping_criteria");
+ }
+ }
+ }🤖 Prompt for AI Agents
In `@test/framework/src/main/java/org/opensearch/index/engine/EngineTestCase.java`
around lines 826 - 836, The override of index(Index index) unconditionally calls
ParseContext.Document.setGroupingCriteria("grouping_criteria") which will
overwrite caller-provided tenant/grouping values (e.g., those set by
testContextSpecificDocumentWithTenantField); change the logic in the index(Index
index) override (inside the config().getIndexSettings().isContextAwareEnabled()
branch) to only call setGroupingCriteria("grouping_criteria") when the document
has no grouping criteria already (e.g., check the document's existing grouping
value via the appropriate getter or null/empty check before calling
setGroupingCriteria), so provided grouping criteria are preserved.
|
❌ Gradle check result for 231cb15: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: RS146BIJAY <rishavsagar4b1@gmail.com>
231cb15 to
dbedebd
Compare
|
❌ Gradle check result for dbedebd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for dbedebd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Adding chaos testing for Context Aware Segments during indexing, refresh and update/delete operations for following scenarios: