You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: manage-data/data-store/data-streams/failure-store-recipes.md
+28-28Lines changed: 28 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ POST my-datastream-ingest/_doc
30
30
},
31
31
"_seq_no": 2,
32
32
"_primary_term": 1,
33
-
"failure_store": "used" // The document was sent to the failure store
33
+
"failure_store": "used" // The document was sent to the failure store.
34
34
}
35
35
```
36
36
@@ -66,7 +66,7 @@ GET my-datastream-ingest::failures/_search
66
66
"index": "my-datastream-ingest",
67
67
"source": { // When an ingest pipeline fails, the document stored is what was originally sent to the cluster.
68
68
"important": {
69
-
"info": "The rain in Spain falls mainly on the plain" // The important info that we failed to find was originally present on the document.
69
+
"info": "The rain in Spain falls mainly on the plain" // The important information that we failed to find was originally present in the document.
70
70
},
71
71
"@timestamp": "2025-04-21T00:00:00Z"
72
72
}
@@ -93,7 +93,7 @@ GET my-datastream-ingest::failures/_search
93
93
}
94
94
```
95
95
96
-
Despite not knowing the pipelines beforehand, we have some places to start looking. The `ingest-step-2` pipeline cannot find the `important.info` field despite it being present on the document that was sent to the cluster. If we pull that pipeline definition we find the following:
96
+
Despite not knowing the pipelines beforehand, we have some places to start looking. The `ingest-step-2` pipeline cannot find the `important.info` field despite it being present in the document that was sent to the cluster. If we pull that pipeline definition we find the following:
97
97
98
98
```console
99
99
GET _ingest/pipeline/ingest-step-2
@@ -126,7 +126,7 @@ GET _ingest/pipeline/ingest-step-1
126
126
"processors": [
127
127
{
128
128
"remove": {
129
-
"field": "important.info" // A remove processor that is incorrectly getting rid of our important field.
129
+
"field": "important.info" // A remove processor that is incorrectly removing our important field.
130
130
}
131
131
},
132
132
{
@@ -143,7 +143,7 @@ We find a remove processor in the first pipeline that is the root cause of the p
Ingest processors can be labeled with [tags](./failure-store.md). These tags are userprovided information that names or describes the processor's purpose in the pipeline. When documents are redirected to the failure store due to a processor issue, they capture the tag from the processor in which the failure occurred if it exists. Because of this, it is a good practice to tag the processors in your pipeline so that the location of a failure can be identified quickly.
146
+
Ingest processors can be labeled with [tags](./failure-store.md). These tags are user-provided information that names or describes the processor's purpose in the pipeline. When documents are redirected to the failure store due to a processor issue, they capture the tag from the processor in which the failure occurred, if it exists. Because of this behavior, it is a good practice to tag the processors in your pipeline so that the location of a failure can be identified quickly.
147
147
148
148
Here we have a needlessly complicated pipeline. It is made up of several set and remove processors. Beneficially, they are all tagged with descriptive names.
149
149
```console
@@ -194,7 +194,7 @@ PUT _ingest/pipeline/complicated-processor
194
194
}
195
195
```
196
196
197
-
We ingest some data and find that it was sent to the failure store
197
+
We ingest some data and find that it was sent to the failure store.
198
198
```console
199
199
POST my-datastream-ingest/_doc?pipeline=complicated-processor
200
200
{
@@ -220,7 +220,7 @@ POST my-datastream-ingest/_doc?pipeline=complicated-processor
220
220
}
221
221
```
222
222
223
-
Upon checking the failure, we can quickly identify the tagged processor that caused the problem
223
+
On checking the failure, we can quickly identify the tagged processor that caused the problem.
224
224
```console
225
225
GET my-datastream-ingest::failures/_search
226
226
```
@@ -268,7 +268,7 @@ GET my-datastream-ingest::failures/_search
268
268
],
269
269
"pipeline": "complicated-processor",
270
270
"processor_type": "set", // Helpful, but which set processor on the pipeline could it be?
271
-
"processor_tag": "copy to new counter again" // The tag of the exact processor that it failed on.
271
+
"processor_tag": "copy to new counter again" // The tag of the exact processor that the document failed on.
272
272
}
273
273
}
274
274
}
@@ -277,11 +277,11 @@ GET my-datastream-ingest::failures/_search
277
277
}
278
278
```
279
279
280
-
Without tags in place it would not be as clear where in the pipeline we encountered the problem. Tags provide a unique identifier for a processor that can be quickly referenced in case of an ingest failure.
280
+
Without tags in place it would not be as clear where in the pipeline the indexing problem occurred. Tags provide a unique identifier for a processor that can be quickly referenced in case of an ingest failure.
281
281
282
282
## Alerting on failed ingestion [failure-store-recipes-alerting]
283
283
284
-
Since failure stores can be searched just like a normal data stream, we can use them as inputs to [alerting rules](./failure-store.md) in Kibana. Here is a simple alerting example to trigger on more than ten failures in the last five minutes for a data stream:
284
+
Since failure stores can be searched just like a normal data stream, we can use them as inputs to [alerting rules](./failure-store.md) in Kibana. Here is a simple alerting example that is triggered when more than ten indexing failures have occurred in the last five minutes for a data stream:
285
285
286
286
:::::{stepper}
287
287
@@ -349,17 +349,17 @@ Care should be taken when replaying data into a data stream from a failure store
349
349
350
350
We recommend a few best practices for remediating failure data.
351
351
352
-
**Separate your failures beforehand.** As described in the [failure document source](#use-failure-store-document-source) section above, failure documents are structured differently depending on when the document failed during ingestion. We recommend to separate documents by ingest pipeline failures and indexing failures at minimum. Ingest pipeline failures often need to have the original pipeline re-executed, while index failures should skip any pipelines. Further separating failures by index or specific failure type may also be beneficial.
352
+
**Separate your failures beforehand.** As described in the [failure document source](#use-failure-store-document-source) section above, failure documents are structured differently depending on when the document failed during ingestion. We recommend to separate documents by ingest pipeline failures and indexing failures at minimum. Ingest pipeline failures often need to have the original pipeline re-run, while index failures should skip any pipelines. Further separating failures by index or specific failure type may also be beneficial.
353
353
354
354
**Perform a failure store rollover.** Consider rolling over the failure store before attempting to remediate failures. This will create a new failure index that will collect any new failures during the remediation process.
355
355
356
-
**Use an ingest pipeline to convert failure documents back into their original document.** Failure documents store failure information along with the document that failed ingestion. The first step for remediating documents should be to use an ingest pipeline to extract the original source from the failure document and discard any other info on it.
356
+
**Use an ingest pipeline to convert failure documents back into their original document.** Failure documents store failure information along with the document that failed ingestion. The first step for remediating documents should be to use an ingest pipeline to extract the original source from the failure document and then discard any other information about the failure.
357
357
358
-
**Simulate first to avoid repeat failures.** If you must execute a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline was applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure.
358
+
**Simulate first to avoid repeat failures.** If you must run a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline is applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure.
Failures that occurred during an ingest processor will be stored as they were before any pipelines were executed. To replay the document into the data stream we will need to rerun any applicable pipelines for the document.
362
+
Failures that occurred during ingest processing will be stored as they were before any pipelines were run. To replay the document into the data stream we will need to re-run any applicable pipelines for the document.
363
363
364
364
:::::{stepper}
365
365
@@ -466,7 +466,7 @@ Take note of the documents that are returned. We can use these to simulate that
466
466
::::
467
467
468
468
::::{step} Fix the original problem
469
-
Because ingest pipeline failures need to be reprocessed by their original pipelines, any problems with those pipeline should be fixed before remediating failures. Investigating the pipeline mentioned in the example above shows that there is a processor that expects a field to be present that is not always present.
469
+
Because ingest pipeline failures need to be reprocessed by their original pipelines, any problems with those pipelines should be fixed before remediating failures. Investigating the pipeline mentioned in the example above shows that there is a processor that expects a field to be present that is not always present.
470
470
471
471
```console-result
472
472
{
@@ -500,7 +500,7 @@ PUT _ingest/pipeline/my-datastream-default-pipeline
500
500
]
501
501
}
502
502
```
503
-
1.Only conditionally run the processor if the field exists.
503
+
1.Conditionally run the processor only if the field exists.
504
504
505
505
::::
506
506
@@ -536,7 +536,7 @@ PUT _ingest/pipeline/my-datastream-remediation-pipeline
536
536
```
537
537
1. Copy the original index name from the failure document over into the document's metadata. If you use custom document routing, copy that over too.
538
538
2. Capture the source of the original document.
539
-
3. Discard the `error` field since it wont be needed for the remediation.
539
+
3. Discard the `error` field since it won't be needed for the remediation.
540
540
4. Also discard the `document` field.
541
541
5. We extract all the fields from the original document's source back to the root of the document.
542
542
6. Since the pipeline that failed was the default pipeline on `my-datastream-ingest-example`, we will use the `reroute` processor to send any remediated documents to that data stream's default pipeline again to be reprocessed.
@@ -632,10 +632,10 @@ POST _ingest/pipeline/_simulate
632
632
}
633
633
```
634
634
1. The index has been updated via the reroute processor.
635
-
2. The id has stayed the same.
636
-
3. The source should cleanly match what the original document should have been.
635
+
2. The document ID has stayed the same.
636
+
3. The source should cleanly match the contents of the original document.
637
637
638
-
Now that the remediation pipeline has been tested, be sure to test the end to end ingestion to verify that no further problems will arise. To do this, we will use the [simulate ingestion API](./failure-store.md) to test multiple pipeline executions.
638
+
Now that the remediation pipeline has been tested, be sure to test the end-to-end ingestion to verify that no further problems will arise. To do this, we will use the [simulate ingestion API](./failure-store.md) to test multiple pipeline executions.
639
639
640
640
```console
641
641
POST _ingest/_simulate?pipeline=my-datastream-remediation-pipeline <1>
@@ -699,7 +699,7 @@ POST _ingest/_simulate?pipeline=my-datastream-remediation-pipeline <1>
699
699
]
700
700
}
701
701
```
702
-
1. Set the pipeline to be the remediation pipeline name, otherwise, the default pipeline for the document's index is used.
702
+
1. Set the pipeline to be the remediation pipeline name, otherwise the default pipeline for the document's index is used.
703
703
2. The contents of the remediation pipeline in previous steps.
704
704
3. The contents of the previously identified example failure document.
705
705
@@ -806,7 +806,7 @@ POST _reindex
806
806
1. The failures have been remediated.
807
807
808
808
:::{tip}
809
-
Since the failure store is enabled on this data stream, it would be wise to check it for any further failures from the reindexing process. Failures that happen at this point in the process may end up as nested failures in the failure store. Remediating nested failures can quickly become a hassle as the original document gets nested multiple levels deep in the failure document. For this reason, it is suggested to remediate data during a quiet period where no other failures will arise. Furthermore, rolling over the failure store before executing the remediation would allow easier discarding of any new nested failures and only operate on the original failure documents.
809
+
Since the failure store is enabled on this data stream, it would be wise to check it for any further failures from the reindexing process. Failures that happen at this point in the process may end up as nested failures in the failure store. Remediating nested failures can quickly become a hassle as the original document gets nested multiple levels deep in the failure document. For this reason, it is suggested to remediate data during a quiet period when no other failures are likely to arise. Furthermore, rolling over the failure store before executing the remediation would allow easier discarding of any new nested failures and only operate on the original failure documents.
810
810
:::
811
811
812
812
::::{step} Done
@@ -816,7 +816,7 @@ Since the failure store is enabled on this data stream, it would be wise to chec
816
816
817
817
### Remediating mapping and shard failures [failure-store-recipes-remediation-mapping]
818
818
819
-
As described in the [failure document source](#use-failure-store-document-source) section above, failures that occur due to a mapping or indexing issue will be stored as they were after any pipelines had executed. This means that to replay the document into the data stream we will need to make sure to skip any pipelines that have already run.
819
+
As described in the previous [failure document source](#use-failure-store-document-source) section, failures that occur due to a mapping or indexing issue will be stored as they were after any pipelines had executed. This means that to replay the document into the data stream we will need to make sure to skip any pipelines that have already run.
820
820
821
821
:::{tip}
822
822
You can greatly simplify this remediation process by writing any ingest pipelines to be idempotent. In that case, any document that has already be processed that passes through a pipeline again would be unchanged.
@@ -869,7 +869,7 @@ POST my-datastream-indexing-example::failures/_search
869
869
3. Further narrow which kind of failure you are attempting to remediate. In this example we are targeting a specific type of error.
870
870
4. Filter on timestamp to only retrieve failures before a certain point in time. This provides a stable set of documents.
871
871
872
-
Take note of the documents that are returned. We can use these to simulate that our remediation logic makes sense
872
+
Take note of the documents that are returned. We can use these to simulate that our remediation logic makes sense.
873
873
```console-result
874
874
{
875
875
"took": 1,
@@ -930,7 +930,7 @@ Caused by: j.l.IllegalArgumentException: data stream timestamp field [@timestamp
930
930
931
931
::::{step} Fix the original problem
932
932
933
-
There are a broad set of possible indexing failures. Most of these problems stem from incorrect values for a particular mapping. Sometimes a large number of new fields are dynamically mapped and the maximum number of mapping fields is reached and no more can be added. In our example above, the document being indexed is missing a required timestamp.
933
+
There are a broad set of possible indexing failures. Most of these problems stem from incorrect values for a particular mapping. Sometimes a large number of new fields are dynamically mapped and the maximum number of mapping fields is reached, so no more can be added. In our example above, the document being indexed is missing a required timestamp.
934
934
935
935
These problems can occur in a number of places: Data sent from a client may be incomplete, ingest pipelines may not be producing the correct result, or the index mapping may need to be updated to account for changes in data.
936
936
@@ -970,7 +970,7 @@ PUT _ingest/pipeline/my-datastream-remediation-pipeline
970
970
5. We extract all the fields from the original document's source back to the root of the document. The `@timestamp` field is not overwritten and thus will be present in the final document.
971
971
972
972
:::{important}
973
-
Remember that a document that has failed during indexing has already been processed by the ingest processor! It shouldn't need to be processed again unless you made changes to your pipeline to fix the original problem. Make sure that any fixes applied to the ingest pipeline is reflected in the pipeline logic here.
973
+
Remember that a document that has failed during indexing has already been processed by the ingest processor! It shouldn't need to be processed again unless you made changes to your pipeline to fix the original problem. Make sure that any fixes applied to the ingest pipeline are reflected in the pipeline logic here.
974
974
:::
975
975
976
976
::::
@@ -1115,8 +1115,8 @@ POST _reindex
1115
1115
```
1116
1116
1. Read from the failure store.
1117
1117
2. Only reindex failure documents that match the ones we are replaying.
1118
-
3. Set the destination to the data stream the failures originally were sent to. The remediation pipeline above updates the index to be the correct one, but a destination is still required.
1119
-
4. Replace the pipeline with the remediation pipeline. This will keep any default pipelines from running.
1118
+
3. Set the destination to the data stream the failures originally were sent to. The remediation pipeline in the example updates the index to be the correct one, but a destination is still required.
1119
+
4. Replace the original pipeline with the remediation pipeline. This will keep any default pipelines from running.
Copy file name to clipboardExpand all lines: manage-data/data-store/data-streams/failure-store.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,7 @@ PUT _data_stream/my-datastream-existing/_options
63
63
1. The failure store option will now be enabled.
64
64
65
65
66
-
The failure store redirection can be disabled using this API as well. When the failure store is deactivated, only failed document redirection is halted. Any existing failure data in the data stream will remain until removed by manual deletion or by retention.
66
+
The failure store redirection can be disabled using this API as well. When the failure store is deactivated, only failed document redirection is halted. Any existing failure data in the data stream will remain until removed by manual deletion or until the data expires due to reaching its max configured retention.
0 commit comments