Fix and add a test for failure store with Incremental bulk #115866

ankikuma · 2024-10-29T17:58:12Z

When a document is rejected because of indexing pressure, it should not be redirected to the failure store.

The failure store is not meant to be a dead letter queue - it’s a best effort storage location for documents that cannot be ingested because there is some kind of fault in their shape or content, this way a user can fix them.

In the case of indexing pressure there is nothing wrong with the document itself. In this PR we fix the redirection to the failure store and we add an integration test to test the interaction of the failure store and incremental bulk's short circuit failure feature.

Closes ES-9577.

Refresh to latest

Refresh

elasticsearchmachine · 2024-10-30T18:29:51Z

Pinging @elastic/es-distributed (Team:Distributed)

Tim-Brooks

LGTM if someone from the data management team confirms this is the expected behavior and what they were looking to test.

Tim-Brooks · 2024-10-31T00:21:03Z

...ClusterTest/java/org/elasticsearch/datastreams/FailureStoreMetricsWithIncrementalBulkIT.java

+        int docs_in_fs = 0;
+        for (int i = (int) hits.get(); i < bulkResponse.getItems().length; ++i) {
+            BulkItemResponse item = bulkResponse.getItems()[i];
+            if (item.isFailed()) {


Just so I understand this correctly:

If an indexing operation fails and there is a failure store configured will then try to store that failure in the failure store. If that operation succeeds, then the indexing operations is indicated to the user as "successful"?

As long as that is the expected behavior this mitigation looks good to me.

Yes your understanding is correct @Tim-Brooks. If we successfully index into the failure store, the indexing operation is considered successful, even though the operation failed to index into the original index.

Hi @ankikuma & @Tim-Brooks ,

As long as that is the expected behavior this mitigation looks good to

I do not think this is expected behaviour.

I am afraid I mislead @ankikuma during our last chat.

If we successfully index into the failure store, the indexing operation is considered successful, even though the operation failed to index into the original index.

This is correct, and it explains the behaviour that the code exhibits. But if I remember correctly the conversations we looked into, the purpose of the failure store is to store failures as a result from a user misconfiguration and not technical limitations/failures.

Considering this, I would expect that this type of failure should not be redirected to the failure store and it should result in a failed response.

As a way forward, I see two options depending on the scope of this work:

If the purpose of this test is to check that incremental indexing and failure store work as expected, I would say that we need to fix the bug that this test has unearthed.

If the purpose is to only add a test to cover the failure store and incremental bulk indexing working together, we should write the test to work as expected and then open a bug and mute this test.

Preferably, I would prefer the first but I do not fully know the scope of this work, so I would like to offer an alternative as well.

@gmarouli The scope of this test was to just test the interaction of incremental bulk failures caused by the short circuit feature with failure store, based on the comment from @nielsbauman here. And in this original comment , it looks like we expected the short circuited requests to got to the failure store.

However, as this quote of James Baiera indicated that if failures occurred due to resource constraints, they should not go to the failure store:

I think if we reject requests due to resource constraints that’s ok, since the failure store is not meant to be a dead letter queue - it’s a best effort storage location for documents that cannot be ingested because there is some kind of fault in their shape or content.
For instance, if the failure store index on a data stream is not allocated, we simply reject the document, nothing to be done. If there’s no memory to execute a write, or if there is no thread capacity, there’s nothing we can do

Now a short circuit failure is triggered due to a previous failure. It just so happens that in this test we simulate that failure based on indexing pressure. I am not sure how one would distinguish between failures caused by resource constraints vs. other types of failures (for the benefit of the failure store).

If I am not mistaken the exception thrown is EsRejectedExecutionException right? In this case we can extend the conditions at the point where we determine if a request should be redirected to the failure store or not. See:

elasticsearch/server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java

Lines 544 to 546 in e0a4584

if (isFailureStoreRequest == false

&& failureStoreCandidate.isFailureStoreEnabled()

&& error instanceof VersionConflictEngineException == false) {

If you agree I think it's worth the effort because right now the assertions are much more complex than they need to be. I could also give it a go if you want and we can ask @jbaiera to review. Would you feel more comfortable with that approach?

Tim-Brooks · 2024-10-31T00:21:25Z

...ClusterTest/java/org/elasticsearch/datastreams/FailureStoreMetricsWithIncrementalBulkIT.java

+        }
+
+        int docs_redirected_to_fs = 0;
+        int docs_in_fs = 0;


Java variables should be camel case. docsInFS

Oops. Done.

Tim-Brooks · 2024-10-31T00:21:42Z

...ClusterTest/java/org/elasticsearch/datastreams/FailureStoreMetricsWithIncrementalBulkIT.java

+            assertTrue(bulkResponse.getItems()[i].getFailureStoreStatus().getLabel().equalsIgnoreCase("NOT_APPLICABLE_OR_UNKNOWN"));
+        }
+
+        int docs_redirected_to_fs = 0;


Java variables should be camel case. docsRedirectedToFs

elasticsearchmachine · 2024-11-06T14:56:59Z

Pinging @elastic/es-data-management (Team:Data Management)

ankikuma · 2024-11-06T15:00:14Z

@gmarouli @nielsbauman Could you please take a look at this test. This test is added as a follow up to the discussion here. The test behavior is as follows:

We first try to index the document into the datastream.
When the document is rejected (due to indexing pressure in our testcase), the document is redirected to the failure store.
If the failure store's backing index is on the same node as the datastream's backing index, the same indexing pressure will apply to the failure store and the request will in fact fail because we are unable to index the doc into the failure store. In this case the getFailureStoreStatus() for this request will be FAILED.
But if the failure store's backing index is on a different node, the request will succeed because the doc is successfully stored in the failure store. In this case the getFailureStoreStatus() for this request will be USED.

Based on the slack thread linked above, we don't want to use the failure store to store docs that were rejected due to a resource level failure. But my testcase shows that we do. So perhaps we need a follow up change to the failure store code.

nielsbauman · 2024-11-06T15:06:09Z

Hi @ankikuma, I'm currently not on the failure store project anymore so I unfortunately won't be able to have a look at this. I'll leave you in the good hands of @gmarouli :)

Refresh

gmarouli

Hi @ankikuma I hold off the review until I understand better the scope of this PR. But it looks good in general :).

Thank you for including the failure store in your test suite!

gmarouli · 2024-11-11T08:15:45Z

...ClusterTest/java/org/elasticsearch/datastreams/FailureStoreMetricsWithIncrementalBulkIT.java

+    private String dataStream = "data-stream-incremental";
+    private String template = "template-incremental";


Could we convert them these to constants? They look like they are.

gmarouli · 2024-11-11T08:30:30Z

...ClusterTest/java/org/elasticsearch/datastreams/FailureStoreMetricsWithIncrementalBulkIT.java

+        int docs_in_fs = 0;
+        for (int i = (int) hits.get(); i < bulkResponse.getItems().length; ++i) {
+            BulkItemResponse item = bulkResponse.getItems()[i];
+            if (item.isFailed()) {


If I am not mistaken the exception thrown is EsRejectedExecutionException right? In this case we can extend the conditions at the point where we determine if a request should be redirected to the failure store or not. See:

elasticsearch/server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java

Lines 544 to 546 in e0a4584

if (isFailureStoreRequest == false

&& failureStoreCandidate.isFailureStoreEnabled()

&& error instanceof VersionConflictEngineException == false) {

If you agree I think it's worth the effort because right now the assertions are much more complex than they need to be. I could also give it a go if you want and we can ask @jbaiera to review. Would you feel more comfortable with that approach?

ankikuma · 2024-11-11T16:48:18Z

Thank you Mary for offering to try out a fix. I can incorporate your changes into my PR.

elasticsearchmachine · 2024-11-11T21:02:50Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

gmarouli

Hi @ankikuma , as I started working on changing the test to match the fix, I noticed a few things that were confusing me a bit. I added an explanation for every change I performed that was not related with the actual fix.

This is still your PR so if you find that something does not suit you feel free to revert or improve.

gmarouli · 2024-11-11T21:03:36Z

...ClusterTest/java/org/elasticsearch/datastreams/FailureStoreMetricsWithIncrementalBulkIT.java

+    }
+
+    public void testShortCircuitFailure() throws Exception {
+        createDataStreamWithFailureStore();


@ankikuma , I thought this test is pretty targeted to failure store so we could merge the template and data stream creation in one method.

gmarouli · 2024-11-11T21:05:44Z

...ClusterTest/java/org/elasticsearch/datastreams/FailureStoreMetricsWithIncrementalBulkIT.java

+        try (IncrementalBulkService.Handler handler = incrementalBulkService.newBulkRequest()) {
+
+            AtomicBoolean nextRequested = new AtomicBoolean(true);
+            int successfullyStored = 0;


@ankikuma I renamed this to successfullyStored because hits confused me since I have associated it with search results. I also did not see the need to use an atomic.

gmarouli · 2024-11-11T21:06:45Z

...ClusterTest/java/org/elasticsearch/datastreams/FailureStoreMetricsWithIncrementalBulkIT.java

+            assertDataStreamMetric(metrics, FailureStoreMetrics.METRIC_REJECTED, DATA_STREAM_NAME, 0);
+
+            // Introduce artificial pressure that will reject the following requests
+            String node = findNodeOfPrimaryShard(DATA_STREAM_NAME);


@ankikuma I abstracted finding the node of the primary shard in one method because I thought the retrieval did not add much to the flow of the test.

gmarouli · 2024-11-11T21:10:29Z

...ClusterTest/java/org/elasticsearch/datastreams/FailureStoreMetricsWithIncrementalBulkIT.java

+    private void assertDataStreamMetric(Map<String, List<Measurement>> metrics, String metric, String dataStreamName, int expectedValue) {
+        List<Measurement> measurements = metrics.get(metric);
+        assertThat(measurements, notNullValue());
+        long totalValue = measurements.stream()
+            .filter(m -> m.attributes().get("data_stream").equals(dataStreamName))
+            .mapToLong(Measurement::getLong)
+            .sum();
+        assertThat(totalValue, equalTo((long) expectedValue));
+    }


@ankikuma I cleaned up the measurements assertion here a bit, not all of them were used and I found this structure a bit easier to follow.

The way I understand it, we provide the observed telemetry, the metric and the data streams we are interested in and the expected value, then it will sum the values up and verify.

The previous one was also correct, but I find this a bit more intuitive while before I got a bit lost why do we need the size of the list ;).

I am also using at every point when we check the measurements, even when the measurements would be empty because I also find the symmetry easier to read.

gmarouli · 2024-11-12T06:32:16Z

@elasticmachine update branch

gmarouli · 2024-11-12T10:42:48Z

@elasticmachine update branch

Refresh

ankikuma · 2024-11-14T04:57:09Z

@jbaiera could you please review the test and fix in this PR to see if it is aligned with the expected behavior of the failure store. Thank you Mary for making the changes!

jbaiera

The code changes and line of thinking LGTM.

✅ assuming green CI

Refresh

…15866) When a document is rejected because of indexing pressure, it should not be redirected to the failure store. The failure store is not meant to be a dead letter queue - it’s a best effort storage location for documents that cannot be ingested because there is some kind of fault in their shape or content, this way a user can fix them. In the case of indexing pressure there is nothing wrong with the document itself. In this PR we fix the redirection to the failure store and we add an integration test to test the interaction of the failure store and incremental bulk's short circuit failure feature. Closes ES-9577. Co-authored-by: gmarouli <[email protected]>

ankikuma added 4 commits October 22, 2024 20:21

just experimenting

9bd52a0

Add test

9b2117d

Merge remote-tracking branch 'upstream/main' into test/ES9577

dd01919

Refresh to latest

Merge remote-tracking branch 'upstream/main' into test/ES9577

ce2d154

Refresh

elasticsearchmachine added the v9.0.0 label Oct 29, 2024

ankikuma added 6 commits October 29, 2024 14:31

Cleanup code

3af8970

Fix test

05e4d7d

Merge remote-tracking branch 'upstream/main' into test/ES9577

4d0ddc6

Refresh

Remove extra lines

4fda9c2

Extra lines

e06a1f7

Extra lines

33d5b34

ankikuma marked this pull request as ready for review October 30, 2024 18:28

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 30, 2024

ankikuma added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >test Issues or PRs that are addressing/adding tests and removed needs:triage Requires assignment of a team area label labels Oct 30, 2024

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 30, 2024

ankikuma assigned Tim-Brooks Oct 30, 2024

Tim-Brooks removed their assignment Oct 30, 2024

Tim-Brooks self-requested a review October 30, 2024 19:10

Tim-Brooks reviewed Oct 31, 2024

View reviewed changes

ankikuma requested review from gmarouli and nielsbauman November 6, 2024 14:53

nielsbauman added :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team labels Nov 6, 2024

elasticsearchmachine removed the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 6, 2024

nielsbauman removed their request for review November 6, 2024 15:06

ankikuma added 3 commits November 6, 2024 16:34

Camelcase names

ebbdced

Merge remote-tracking branch 'upstream/main' into test/ES9577

ce489d1

Refresh

Merge remote-tracking branch 'upstream/main' into test/ES9577

38639e4

Refresh

gmarouli reviewed Nov 8, 2024

View reviewed changes

gmarouli reviewed Nov 11, 2024

View reviewed changes

gmarouli added 2 commits November 11, 2024 23:00

Do not redirect EsRejectedExecutionException failures to failure store

915a1da

Update the test to reflect the change

46255d4

elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Nov 11, 2024

gmarouli reviewed Nov 11, 2024

View reviewed changes

gmarouli changed the title ~~Add a test for failure store with Incremental bulk~~ Fix and add a test for failure store with Incremental bulk Nov 11, 2024

gmarouli requested a review from jbaiera November 11, 2024 21:16

Remove unused import

aa60e1c

Merge branch 'main' into test/ES9577

6590580

elasticmachine and others added 2 commits November 12, 2024 11:42

Merge branch 'main' into test/ES9577

17bfbdd

Merge remote-tracking branch 'upstream/main' into test/ES9577

594ed59

Refresh

jbaiera approved these changes Nov 14, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into test/ES9577

2eb6f72

Refresh

ankikuma merged commit 713788d into elastic:main Nov 15, 2024
16 checks passed

	if (isFailureStoreRequest == false
	&& failureStoreCandidate.isFailureStoreEnabled()
	&& error instanceof VersionConflictEngineException == false) {

		private String dataStream = "data-stream-incremental";
		private String template = "template-incremental";

Fix and add a test for failure store with Incremental bulk #115866

Fix and add a test for failure store with Incremental bulk #115866

Uh oh!

Conversation

ankikuma commented Oct 29, 2024 • edited by gmarouli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 30, 2024

Uh oh!

Tim-Brooks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankikuma Nov 8, 2024 • edited by gmarouli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankikuma Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Nov 6, 2024

Uh oh!

ankikuma commented Nov 6, 2024

Uh oh!

nielsbauman commented Nov 6, 2024

Uh oh!

gmarouli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankikuma commented Nov 11, 2024

Uh oh!

elasticsearchmachine commented Nov 11, 2024

Uh oh!

gmarouli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarouli commented Nov 12, 2024

Uh oh!

gmarouli commented Nov 12, 2024

Uh oh!

ankikuma commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbaiera left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ankikuma commented Oct 29, 2024 •

edited by gmarouli

Loading

ankikuma Nov 8, 2024 •

edited by gmarouli

Loading

ankikuma Nov 7, 2024 •

edited

Loading

ankikuma commented Nov 14, 2024 •

edited

Loading