Skip to content

Commit 959bfd8

Browse files
jbaierakilfoyle
andauthored
Apply suggestions from code review
Co-authored-by: David Kilfoyle <[email protected]>
1 parent 8869cda commit 959bfd8

File tree

2 files changed

+29
-29
lines changed

2 files changed

+29
-29
lines changed

manage-data/data-store/data-streams/failure-store-recipes.md

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ POST my-datastream-ingest/_doc
3030
},
3131
"_seq_no": 2,
3232
"_primary_term": 1,
33-
"failure_store": "used" // The document was sent to the failure store
33+
"failure_store": "used" // The document was sent to the failure store.
3434
}
3535
```
3636

@@ -66,7 +66,7 @@ GET my-datastream-ingest::failures/_search
6666
"index": "my-datastream-ingest",
6767
"source": { // When an ingest pipeline fails, the document stored is what was originally sent to the cluster.
6868
"important": {
69-
"info": "The rain in Spain falls mainly on the plain" // The important info that we failed to find was originally present on the document.
69+
"info": "The rain in Spain falls mainly on the plain" // The important information that we failed to find was originally present in the document.
7070
},
7171
"@timestamp": "2025-04-21T00:00:00Z"
7272
}
@@ -93,7 +93,7 @@ GET my-datastream-ingest::failures/_search
9393
}
9494
```
9595

96-
Despite not knowing the pipelines beforehand, we have some places to start looking. The `ingest-step-2` pipeline cannot find the `important.info` field despite it being present on the document that was sent to the cluster. If we pull that pipeline definition we find the following:
96+
Despite not knowing the pipelines beforehand, we have some places to start looking. The `ingest-step-2` pipeline cannot find the `important.info` field despite it being present in the document that was sent to the cluster. If we pull that pipeline definition we find the following:
9797

9898
```console
9999
GET _ingest/pipeline/ingest-step-2
@@ -126,7 +126,7 @@ GET _ingest/pipeline/ingest-step-1
126126
"processors": [
127127
{
128128
"remove": {
129-
"field": "important.info" // A remove processor that is incorrectly getting rid of our important field.
129+
"field": "important.info" // A remove processor that is incorrectly removing our important field.
130130
}
131131
},
132132
{
@@ -143,7 +143,7 @@ We find a remove processor in the first pipeline that is the root cause of the p
143143

144144
## Troubleshooting complicated ingest pipelines [failure-store-recipes-complicated-ingest-troubleshoot]
145145

146-
Ingest processors can be labeled with [tags](./failure-store.md). These tags are user provided information that names or describes the processor's purpose in the pipeline. When documents are redirected to the failure store due to a processor issue, they capture the tag from the processor in which the failure occurred if it exists. Because of this, it is a good practice to tag the processors in your pipeline so that the location of a failure can be identified quickly.
146+
Ingest processors can be labeled with [tags](./failure-store.md). These tags are user-provided information that names or describes the processor's purpose in the pipeline. When documents are redirected to the failure store due to a processor issue, they capture the tag from the processor in which the failure occurred, if it exists. Because of this behavior, it is a good practice to tag the processors in your pipeline so that the location of a failure can be identified quickly.
147147

148148
Here we have a needlessly complicated pipeline. It is made up of several set and remove processors. Beneficially, they are all tagged with descriptive names.
149149
```console
@@ -194,7 +194,7 @@ PUT _ingest/pipeline/complicated-processor
194194
}
195195
```
196196

197-
We ingest some data and find that it was sent to the failure store
197+
We ingest some data and find that it was sent to the failure store.
198198
```console
199199
POST my-datastream-ingest/_doc?pipeline=complicated-processor
200200
{
@@ -220,7 +220,7 @@ POST my-datastream-ingest/_doc?pipeline=complicated-processor
220220
}
221221
```
222222

223-
Upon checking the failure, we can quickly identify the tagged processor that caused the problem
223+
On checking the failure, we can quickly identify the tagged processor that caused the problem.
224224
```console
225225
GET my-datastream-ingest::failures/_search
226226
```
@@ -268,7 +268,7 @@ GET my-datastream-ingest::failures/_search
268268
],
269269
"pipeline": "complicated-processor",
270270
"processor_type": "set", // Helpful, but which set processor on the pipeline could it be?
271-
"processor_tag": "copy to new counter again" // The tag of the exact processor that it failed on.
271+
"processor_tag": "copy to new counter again" // The tag of the exact processor that the document failed on.
272272
}
273273
}
274274
}
@@ -277,11 +277,11 @@ GET my-datastream-ingest::failures/_search
277277
}
278278
```
279279

280-
Without tags in place it would not be as clear where in the pipeline we encountered the problem. Tags provide a unique identifier for a processor that can be quickly referenced in case of an ingest failure.
280+
Without tags in place it would not be as clear where in the pipeline the indexing problem occurred. Tags provide a unique identifier for a processor that can be quickly referenced in case of an ingest failure.
281281

282282
## Alerting on failed ingestion [failure-store-recipes-alerting]
283283

284-
Since failure stores can be searched just like a normal data stream, we can use them as inputs to [alerting rules](./failure-store.md) in Kibana. Here is a simple alerting example to trigger on more than ten failures in the last five minutes for a data stream:
284+
Since failure stores can be searched just like a normal data stream, we can use them as inputs to [alerting rules](./failure-store.md) in Kibana. Here is a simple alerting example that is triggered when more than ten indexing failures have occurred in the last five minutes for a data stream:
285285

286286
:::::{stepper}
287287

@@ -349,17 +349,17 @@ Care should be taken when replaying data into a data stream from a failure store
349349

350350
We recommend a few best practices for remediating failure data.
351351

352-
**Separate your failures beforehand.** As described in the [failure document source](#use-failure-store-document-source) section above, failure documents are structured differently depending on when the document failed during ingestion. We recommend to separate documents by ingest pipeline failures and indexing failures at minimum. Ingest pipeline failures often need to have the original pipeline re-executed, while index failures should skip any pipelines. Further separating failures by index or specific failure type may also be beneficial.
352+
**Separate your failures beforehand.** As described in the [failure document source](#use-failure-store-document-source) section above, failure documents are structured differently depending on when the document failed during ingestion. We recommend to separate documents by ingest pipeline failures and indexing failures at minimum. Ingest pipeline failures often need to have the original pipeline re-run, while index failures should skip any pipelines. Further separating failures by index or specific failure type may also be beneficial.
353353

354354
**Perform a failure store rollover.** Consider rolling over the failure store before attempting to remediate failures. This will create a new failure index that will collect any new failures during the remediation process.
355355

356-
**Use an ingest pipeline to convert failure documents back into their original document.** Failure documents store failure information along with the document that failed ingestion. The first step for remediating documents should be to use an ingest pipeline to extract the original source from the failure document and discard any other info on it.
356+
**Use an ingest pipeline to convert failure documents back into their original document.** Failure documents store failure information along with the document that failed ingestion. The first step for remediating documents should be to use an ingest pipeline to extract the original source from the failure document and then discard any other information about the failure.
357357

358-
**Simulate first to avoid repeat failures.** If you must execute a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline was applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure.
358+
**Simulate first to avoid repeat failures.** If you must run a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline is applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure.
359359

360360
### Remediating ingest node failures [failure-store-recipes-remediation-ingest]
361361

362-
Failures that occurred during an ingest processor will be stored as they were before any pipelines were executed. To replay the document into the data stream we will need to rerun any applicable pipelines for the document.
362+
Failures that occurred during ingest processing will be stored as they were before any pipelines were run. To replay the document into the data stream we will need to re-run any applicable pipelines for the document.
363363

364364
:::::{stepper}
365365

@@ -466,7 +466,7 @@ Take note of the documents that are returned. We can use these to simulate that
466466
::::
467467

468468
::::{step} Fix the original problem
469-
Because ingest pipeline failures need to be reprocessed by their original pipelines, any problems with those pipeline should be fixed before remediating failures. Investigating the pipeline mentioned in the example above shows that there is a processor that expects a field to be present that is not always present.
469+
Because ingest pipeline failures need to be reprocessed by their original pipelines, any problems with those pipelines should be fixed before remediating failures. Investigating the pipeline mentioned in the example above shows that there is a processor that expects a field to be present that is not always present.
470470

471471
```console-result
472472
{
@@ -500,7 +500,7 @@ PUT _ingest/pipeline/my-datastream-default-pipeline
500500
]
501501
}
502502
```
503-
1. Only conditionally run the processor if the field exists.
503+
1. Conditionally run the processor only if the field exists.
504504

505505
::::
506506

@@ -536,7 +536,7 @@ PUT _ingest/pipeline/my-datastream-remediation-pipeline
536536
```
537537
1. Copy the original index name from the failure document over into the document's metadata. If you use custom document routing, copy that over too.
538538
2. Capture the source of the original document.
539-
3. Discard the `error` field since it wont be needed for the remediation.
539+
3. Discard the `error` field since it won't be needed for the remediation.
540540
4. Also discard the `document` field.
541541
5. We extract all the fields from the original document's source back to the root of the document.
542542
6. Since the pipeline that failed was the default pipeline on `my-datastream-ingest-example`, we will use the `reroute` processor to send any remediated documents to that data stream's default pipeline again to be reprocessed.
@@ -632,10 +632,10 @@ POST _ingest/pipeline/_simulate
632632
}
633633
```
634634
1. The index has been updated via the reroute processor.
635-
2. The id has stayed the same.
636-
3. The source should cleanly match what the original document should have been.
635+
2. The document ID has stayed the same.
636+
3. The source should cleanly match the contents of the original document.
637637

638-
Now that the remediation pipeline has been tested, be sure to test the end to end ingestion to verify that no further problems will arise. To do this, we will use the [simulate ingestion API](./failure-store.md) to test multiple pipeline executions.
638+
Now that the remediation pipeline has been tested, be sure to test the end-to-end ingestion to verify that no further problems will arise. To do this, we will use the [simulate ingestion API](./failure-store.md) to test multiple pipeline executions.
639639

640640
```console
641641
POST _ingest/_simulate?pipeline=my-datastream-remediation-pipeline <1>
@@ -699,7 +699,7 @@ POST _ingest/_simulate?pipeline=my-datastream-remediation-pipeline <1>
699699
]
700700
}
701701
```
702-
1. Set the pipeline to be the remediation pipeline name, otherwise, the default pipeline for the document's index is used.
702+
1. Set the pipeline to be the remediation pipeline name, otherwise the default pipeline for the document's index is used.
703703
2. The contents of the remediation pipeline in previous steps.
704704
3. The contents of the previously identified example failure document.
705705

@@ -806,7 +806,7 @@ POST _reindex
806806
1. The failures have been remediated.
807807

808808
:::{tip}
809-
Since the failure store is enabled on this data stream, it would be wise to check it for any further failures from the reindexing process. Failures that happen at this point in the process may end up as nested failures in the failure store. Remediating nested failures can quickly become a hassle as the original document gets nested multiple levels deep in the failure document. For this reason, it is suggested to remediate data during a quiet period where no other failures will arise. Furthermore, rolling over the failure store before executing the remediation would allow easier discarding of any new nested failures and only operate on the original failure documents.
809+
Since the failure store is enabled on this data stream, it would be wise to check it for any further failures from the reindexing process. Failures that happen at this point in the process may end up as nested failures in the failure store. Remediating nested failures can quickly become a hassle as the original document gets nested multiple levels deep in the failure document. For this reason, it is suggested to remediate data during a quiet period when no other failures are likely to arise. Furthermore, rolling over the failure store before executing the remediation would allow easier discarding of any new nested failures and only operate on the original failure documents.
810810
:::
811811

812812
::::{step} Done
@@ -816,7 +816,7 @@ Since the failure store is enabled on this data stream, it would be wise to chec
816816

817817
### Remediating mapping and shard failures [failure-store-recipes-remediation-mapping]
818818

819-
As described in the [failure document source](#use-failure-store-document-source) section above, failures that occur due to a mapping or indexing issue will be stored as they were after any pipelines had executed. This means that to replay the document into the data stream we will need to make sure to skip any pipelines that have already run.
819+
As described in the previous [failure document source](#use-failure-store-document-source) section, failures that occur due to a mapping or indexing issue will be stored as they were after any pipelines had executed. This means that to replay the document into the data stream we will need to make sure to skip any pipelines that have already run.
820820

821821
:::{tip}
822822
You can greatly simplify this remediation process by writing any ingest pipelines to be idempotent. In that case, any document that has already be processed that passes through a pipeline again would be unchanged.
@@ -869,7 +869,7 @@ POST my-datastream-indexing-example::failures/_search
869869
3. Further narrow which kind of failure you are attempting to remediate. In this example we are targeting a specific type of error.
870870
4. Filter on timestamp to only retrieve failures before a certain point in time. This provides a stable set of documents.
871871

872-
Take note of the documents that are returned. We can use these to simulate that our remediation logic makes sense
872+
Take note of the documents that are returned. We can use these to simulate that our remediation logic makes sense.
873873
```console-result
874874
{
875875
"took": 1,
@@ -930,7 +930,7 @@ Caused by: j.l.IllegalArgumentException: data stream timestamp field [@timestamp
930930

931931
::::{step} Fix the original problem
932932

933-
There are a broad set of possible indexing failures. Most of these problems stem from incorrect values for a particular mapping. Sometimes a large number of new fields are dynamically mapped and the maximum number of mapping fields is reached and no more can be added. In our example above, the document being indexed is missing a required timestamp.
933+
There are a broad set of possible indexing failures. Most of these problems stem from incorrect values for a particular mapping. Sometimes a large number of new fields are dynamically mapped and the maximum number of mapping fields is reached, so no more can be added. In our example above, the document being indexed is missing a required timestamp.
934934

935935
These problems can occur in a number of places: Data sent from a client may be incomplete, ingest pipelines may not be producing the correct result, or the index mapping may need to be updated to account for changes in data.
936936

@@ -970,7 +970,7 @@ PUT _ingest/pipeline/my-datastream-remediation-pipeline
970970
5. We extract all the fields from the original document's source back to the root of the document. The `@timestamp` field is not overwritten and thus will be present in the final document.
971971

972972
:::{important}
973-
Remember that a document that has failed during indexing has already been processed by the ingest processor! It shouldn't need to be processed again unless you made changes to your pipeline to fix the original problem. Make sure that any fixes applied to the ingest pipeline is reflected in the pipeline logic here.
973+
Remember that a document that has failed during indexing has already been processed by the ingest processor! It shouldn't need to be processed again unless you made changes to your pipeline to fix the original problem. Make sure that any fixes applied to the ingest pipeline are reflected in the pipeline logic here.
974974
:::
975975

976976
::::
@@ -1115,8 +1115,8 @@ POST _reindex
11151115
```
11161116
1. Read from the failure store.
11171117
2. Only reindex failure documents that match the ones we are replaying.
1118-
3. Set the destination to the data stream the failures originally were sent to. The remediation pipeline above updates the index to be the correct one, but a destination is still required.
1119-
4. Replace the pipeline with the remediation pipeline. This will keep any default pipelines from running.
1118+
3. Set the destination to the data stream the failures originally were sent to. The remediation pipeline in the example updates the index to be the correct one, but a destination is still required.
1119+
4. Replace the original pipeline with the remediation pipeline. This will keep any default pipelines from running.
11201120

11211121
```console-result
11221122
{

manage-data/data-store/data-streams/failure-store.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ PUT _data_stream/my-datastream-existing/_options
6363
1. The failure store option will now be enabled.
6464

6565

66-
The failure store redirection can be disabled using this API as well. When the failure store is deactivated, only failed document redirection is halted. Any existing failure data in the data stream will remain until removed by manual deletion or by retention.
66+
The failure store redirection can be disabled using this API as well. When the failure store is deactivated, only failed document redirection is halted. Any existing failure data in the data stream will remain until removed by manual deletion or until the data expires due to reaching its max configured retention.
6767

6868
```console
6969
PUT _data_stream/my-datastream-existing/_options

0 commit comments

Comments
 (0)