Skip to content

Conversation

ankikuma
Copy link
Contributor

@ankikuma ankikuma commented Oct 29, 2024

When a document is rejected because of indexing pressure, it should not be redirected to the failure store.

The failure store is not meant to be a dead letter queue - it’s a best effort storage location for documents that cannot be ingested because there is some kind of fault in their shape or content, this way a user can fix them.

In the case of indexing pressure there is nothing wrong with the document itself. In this PR we fix the redirection to the failure store and we add an integration test to test the interaction of the failure store and incremental bulk's short circuit failure feature.

Closes ES-9577.

@ankikuma ankikuma marked this pull request as ready for review October 30, 2024 18:28
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 30, 2024
@ankikuma ankikuma added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >test Issues or PRs that are addressing/adding tests and removed needs:triage Requires assignment of a team area label labels Oct 30, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 30, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@Tim-Brooks Tim-Brooks removed their assignment Oct 30, 2024
@Tim-Brooks Tim-Brooks self-requested a review October 30, 2024 19:10
Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if someone from the data management team confirms this is the expected behavior and what they were looking to test.

int docs_in_fs = 0;
for (int i = (int) hits.get(); i < bulkResponse.getItems().length; ++i) {
BulkItemResponse item = bulkResponse.getItems()[i];
if (item.isFailed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand this correctly:

If an indexing operation fails and there is a failure store configured will then try to store that failure in the failure store. If that operation succeeds, then the indexing operations is indicated to the user as "successful"?

As long as that is the expected behavior this mitigation looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes your understanding is correct @Tim-Brooks. If we successfully index into the failure store, the indexing operation is considered successful, even though the operation failed to index into the original index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankikuma & @Tim-Brooks ,

As long as that is the expected behavior this mitigation looks good to

I do not think this is expected behaviour.

I am afraid I mislead @ankikuma during our last chat.

If we successfully index into the failure store, the indexing operation is considered successful, even though the operation failed to index into the original index.

This is correct, and it explains the behaviour that the code exhibits. But if I remember correctly the conversations we looked into, the purpose of the failure store is to store failures as a result from a user misconfiguration and not technical limitations/failures.

Considering this, I would expect that this type of failure should not be redirected to the failure store and it should result in a failed response.

As a way forward, I see two options depending on the scope of this work:

  • If the purpose of this test is to check that incremental indexing and failure store work as expected, I would say that we need to fix the bug that this test has unearthed.
  • If the purpose is to only add a test to cover the failure store and incremental bulk indexing working together, we should write the test to work as expected and then open a bug and mute this test.

Preferably, I would prefer the first but I do not fully know the scope of this work, so I would like to offer an alternative as well.

Copy link
Contributor Author

@ankikuma ankikuma Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmarouli The scope of this test was to just test the interaction of incremental bulk failures caused by the short circuit feature with failure store, based on the comment from @nielsbauman here. And in this original comment , it looks like we expected the short circuited requests to got to the failure store.

However, as this quote of James Baiera indicated that if failures occurred due to resource constraints, they should not go to the failure store:

I think if we reject requests due to resource constraints that’s ok, since the failure store is not meant to be a dead letter queue - it’s a best effort storage location for documents that cannot be ingested because there is some kind of fault in their shape or content.
For instance, if the failure store index on a data stream is not allocated, we simply reject the document, nothing to be done. If there’s no memory to execute a write, or if there is no thread capacity, there’s nothing we can do

Now a short circuit failure is triggered due to a previous failure. It just so happens that in this test we simulate that failure based on indexing pressure. I am not sure how one would distinguish between failures caused by resource constraints vs. other types of failures (for the benefit of the failure store).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken the exception thrown is EsRejectedExecutionException right? In this case we can extend the conditions at the point where we determine if a request should be redirected to the failure store or not. See:

if (isFailureStoreRequest == false
&& failureStoreCandidate.isFailureStoreEnabled()
&& error instanceof VersionConflictEngineException == false) {

If you agree I think it's worth the effort because right now the assertions are much more complex than they need to be. I could also give it a go if you want and we can ask @jbaiera to review. Would you feel more comfortable with that approach?

}

int docs_redirected_to_fs = 0;
int docs_in_fs = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java variables should be camel case. docsInFS

Copy link
Contributor Author

@ankikuma ankikuma Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Done.

assertTrue(bulkResponse.getItems()[i].getFailureStoreStatus().getLabel().equalsIgnoreCase("NOT_APPLICABLE_OR_UNKNOWN"));
}

int docs_redirected_to_fs = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java variables should be camel case. docsRedirectedToFs

@nielsbauman nielsbauman added :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team labels Nov 6, 2024
@elasticsearchmachine elasticsearchmachine removed the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 6, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@ankikuma
Copy link
Contributor Author

ankikuma commented Nov 6, 2024

@gmarouli @nielsbauman Could you please take a look at this test. This test is added as a follow up to the discussion here. The test behavior is as follows:

  1. We first try to index the document into the datastream.
  2. When the document is rejected (due to indexing pressure in our testcase), the document is redirected to the failure store.
  3. If the failure store's backing index is on the same node as the datastream's backing index, the same indexing pressure will apply to the failure store and the request will in fact fail because we are unable to index the doc into the failure store. In this case the getFailureStoreStatus() for this request will be FAILED.
  4. But if the failure store's backing index is on a different node, the request will succeed because the doc is successfully stored in the failure store. In this case the getFailureStoreStatus() for this request will be USED.

Based on the slack thread linked above, we don't want to use the failure store to store docs that were rejected due to a resource level failure. But my testcase shows that we do. So perhaps we need a follow up change to the failure store code.

@nielsbauman
Copy link
Contributor

Hi @ankikuma, I'm currently not on the failure store project anymore so I unfortunately won't be able to have a look at this. I'll leave you in the good hands of @gmarouli :)

@nielsbauman nielsbauman removed their request for review November 6, 2024 15:06
Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankikuma I hold off the review until I understand better the scope of this PR. But it looks good in general :).

Thank you for including the failure store in your test suite!

Comment on lines 69 to 70
private String dataStream = "data-stream-incremental";
private String template = "template-incremental";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we convert them these to constants? They look like they are.

int docs_in_fs = 0;
for (int i = (int) hits.get(); i < bulkResponse.getItems().length; ++i) {
BulkItemResponse item = bulkResponse.getItems()[i];
if (item.isFailed()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken the exception thrown is EsRejectedExecutionException right? In this case we can extend the conditions at the point where we determine if a request should be redirected to the failure store or not. See:

if (isFailureStoreRequest == false
&& failureStoreCandidate.isFailureStoreEnabled()
&& error instanceof VersionConflictEngineException == false) {

If you agree I think it's worth the effort because right now the assertions are much more complex than they need to be. I could also give it a go if you want and we can ask @jbaiera to review. Would you feel more comfortable with that approach?

@ankikuma
Copy link
Contributor Author

Thank you Mary for offering to try out a fix. I can incorporate your changes into my PR.

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Indexing Meta label for Distributed Indexing team label Nov 11, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankikuma , as I started working on changing the test to match the fix, I noticed a few things that were confusing me a bit. I added an explanation for every change I performed that was not related with the actual fix.

This is still your PR so if you find that something does not suit you feel free to revert or improve.

}

public void testShortCircuitFailure() throws Exception {
createDataStreamWithFailureStore();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankikuma , I thought this test is pretty targeted to failure store so we could merge the template and data stream creation in one method.

try (IncrementalBulkService.Handler handler = incrementalBulkService.newBulkRequest()) {

AtomicBoolean nextRequested = new AtomicBoolean(true);
int successfullyStored = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankikuma I renamed this to successfullyStored because hits confused me since I have associated it with search results. I also did not see the need to use an atomic.

assertDataStreamMetric(metrics, FailureStoreMetrics.METRIC_REJECTED, DATA_STREAM_NAME, 0);

// Introduce artificial pressure that will reject the following requests
String node = findNodeOfPrimaryShard(DATA_STREAM_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankikuma I abstracted finding the node of the primary shard in one method because I thought the retrieval did not add much to the flow of the test.

Comment on lines +208 to +216
private void assertDataStreamMetric(Map<String, List<Measurement>> metrics, String metric, String dataStreamName, int expectedValue) {
List<Measurement> measurements = metrics.get(metric);
assertThat(measurements, notNullValue());
long totalValue = measurements.stream()
.filter(m -> m.attributes().get("data_stream").equals(dataStreamName))
.mapToLong(Measurement::getLong)
.sum();
assertThat(totalValue, equalTo((long) expectedValue));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankikuma I cleaned up the measurements assertion here a bit, not all of them were used and I found this structure a bit easier to follow.

The way I understand it, we provide the observed telemetry, the metric and the data streams we are interested in and the expected value, then it will sum the values up and verify.

The previous one was also correct, but I find this a bit more intuitive while before I got a bit lost why do we need the size of the list ;).

I am also using at every point when we check the measurements, even when the measurements would be empty because I also find the symmetry easier to read.

@gmarouli gmarouli changed the title Add a test for failure store with Incremental bulk Fix and add a test for failure store with Incremental bulk Nov 11, 2024
@gmarouli gmarouli requested a review from jbaiera November 11, 2024 21:16
@gmarouli
Copy link
Contributor

@elasticmachine update branch

@gmarouli
Copy link
Contributor

@elasticmachine update branch

@ankikuma
Copy link
Contributor Author

ankikuma commented Nov 14, 2024

@jbaiera could you please review the test and fix in this PR to see if it is aligned with the expected behavior of the failure store. Thank you Mary for making the changes!

Copy link
Member

@jbaiera jbaiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes and line of thinking LGTM.

✅ assuming green CI

@ankikuma ankikuma merged commit 713788d into elastic:main Nov 15, 2024
16 checks passed
salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Nov 18, 2024
…15866)

When a document is rejected because of indexing pressure, it should not be redirected to the failure store.

The failure store is not meant to be a dead letter queue - it’s a best effort storage location for documents that cannot be ingested because there is some kind of fault in their shape or content, this way a user can fix them.

In the case of indexing pressure there is nothing wrong with the document itself. In this PR we fix the redirection to the failure store and we add an integration test to test the interaction of the failure store and incremental bulk's short circuit failure feature.

Closes ES-9577.

Co-authored-by: gmarouli <[email protected]>
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
…15866)

When a document is rejected because of indexing pressure, it should not be redirected to the failure store.

The failure store is not meant to be a dead letter queue - it’s a best effort storage location for documents that cannot be ingested because there is some kind of fault in their shape or content, this way a user can fix them.

In the case of indexing pressure there is nothing wrong with the document itself. In this PR we fix the redirection to the failure store and we add an integration test to test the interaction of the failure store and incremental bulk's short circuit failure feature.

Closes ES-9577.

Co-authored-by: gmarouli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Data streams Data streams and their lifecycles :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Data Management Meta label for data/management team Team:Distributed Indexing Meta label for Distributed Indexing team >test Issues or PRs that are addressing/adding tests v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants