Skip to content

Commit d521fd7

Browse files
committed
Review comments
1 parent 3a88cc7 commit d521fd7

File tree

2 files changed

+42
-24
lines changed

2 files changed

+42
-24
lines changed

manage-data/data-store/data-streams/failure-store-recipes.md

Lines changed: 30 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,10 @@ POST my-datastream-ingest/_doc
3030
},
3131
"_seq_no": 2,
3232
"_primary_term": 1,
33-
"failure_store": "used" // The document was sent to the failure store.
33+
"failure_store": "used" <1>
3434
}
3535
```
36+
1. The document was sent to the failure store.
3637

3738
Now we search the failure store to check the failure document to see what went wrong.
3839
```console
@@ -64,34 +65,40 @@ GET my-datastream-ingest::failures/_search
6465
"@timestamp": "2025-05-09T06:24:48.381Z",
6566
"document": {
6667
"index": "my-datastream-ingest",
67-
"source": { // When an ingest pipeline fails, the document stored is what was originally sent to the cluster.
68+
"source": { <1>
6869
"important": {
69-
"info": "The rain in Spain falls mainly on the plain" // The important information that we failed to find was originally present in the document.
70+
"info": "The rain in Spain falls mainly on the plain" <2>
7071
},
7172
"@timestamp": "2025-04-21T00:00:00Z"
7273
}
7374
},
7475
"error": {
7576
"type": "illegal_argument_exception",
76-
"message": "field [info] not present as part of path [important.info]", // The info field was not present when the failure occurred.
77+
"message": "field [info] not present as part of path [important.info]", <3>
7778
"stack_trace": """j.l.IllegalArgumentException: field [info] not present as part of path [important.info]
7879
at o.e.i.IngestDocument.getFieldValue(IngestDocument.java:202)
7980
at o.e.i.c.SetProcessor.execute(SetProcessor.java:86)
8081
... 19 more
8182
""",
82-
"pipeline_trace": [ // The first pipeline called the second pipeline.
83+
"pipeline_trace": [ <4>
8384
"ingest-step-1",
8485
"ingest-step-2"
8586
],
86-
"pipeline": "ingest-step-2", // The document failed in the second pipeline.
87-
"processor_type": "set" // It failed in the pipeline's set processor.
87+
"pipeline": "ingest-step-2", <5>
88+
"processor_type": "set" <6>
8889
}
8990
}
9091
}
9192
]
9293
}
9394
}
9495
```
96+
1. When an ingest pipeline fails, the document stored is what was originally sent to the cluster.
97+
2. The important information that we failed to find was originally present in the document.
98+
3. The info field was not present when the failure occurred.
99+
4. The first pipeline called the second pipeline.
100+
5. The document failed in the second pipeline.
101+
6. It failed in the pipeline's set processor.
95102

96103
Despite not knowing the pipelines beforehand, we have some places to start looking. The `ingest-step-2` pipeline cannot find the `important.info` field despite it being present in the document that was sent to the cluster. If we pull that pipeline definition we find the following:
97104

@@ -104,15 +111,17 @@ GET _ingest/pipeline/ingest-step-2
104111
"ingest-step-2": {
105112
"processors": [
106113
{
107-
"set": { // There is only one processor here.
114+
"set": { <1>
108115
"field": "copy.info",
109-
"copy_from": "important.info" // This field was missing from the document at this point.
116+
"copy_from": "important.info" <2>
110117
}
111118
}
112119
]
113120
}
114121
}
115122
```
123+
1. There is only one processor here.
124+
2. This field was missing from the document at this point.
116125

117126
There is only a set processor in the `ingest-step-2` pipeline so this is likely not where the root problem is. Remembering the `pipeline_trace` field on the failure we find that `ingest-step-1` was the original pipeline called for this document. It is likely the data stream's default pipeline. Pulling its definition we find the following:
118127

@@ -126,18 +135,20 @@ GET _ingest/pipeline/ingest-step-1
126135
"processors": [
127136
{
128137
"remove": {
129-
"field": "important.info" // A remove processor that is incorrectly removing our important field.
138+
"field": "important.info" <1>
130139
}
131140
},
132141
{
133142
"pipeline": {
134-
"name": "ingest-step-2" // The call to the second pipeline.
143+
"name": "ingest-step-2" <2>
135144
}
136145
}
137146
]
138147
}
139148
}
140149
```
150+
1. A remove processor that is incorrectly removing our important field.
151+
2. The call to the second pipeline.
141152

142153
We find a remove processor in the first pipeline that is the root cause of the problem! The pipeline should be updated to not remove important data, or the downstream pipeline should be changed to not expect the important data to be always present.
143154

@@ -267,15 +278,17 @@ GET my-datastream-ingest::failures/_search
267278
"complicated-processor"
268279
],
269280
"pipeline": "complicated-processor",
270-
"processor_type": "set", // Helpful, but which set processor on the pipeline could it be?
271-
"processor_tag": "copy to new counter again" // The tag of the exact processor that the document failed on.
281+
"processor_type": "set", <1>
282+
"processor_tag": "copy to new counter again" <2>
272283
}
273284
}
274285
}
275286
]
276287
}
277288
}
278289
```
290+
1. Helpful, but which set processor on the pipeline could it be?
291+
2. The tag of the exact processor that the document failed on.
279292

280293
Without tags in place it would not be as clear where in the pipeline the indexing problem occurred. Tags provide a unique identifier for a processor that can be quickly referenced in case of an ingest failure.
281294

@@ -352,11 +365,11 @@ We recommend a few best practices for remediating failure data.
352365

353366
**Separate your failures beforehand.** As described in the previous [failure document source](./failure-store.md#use-failure-store-document-source) section, failure documents are structured differently depending on when the document failed during ingestion. We recommend to separate documents by ingest pipeline failures and indexing failures at minimum. Ingest pipeline failures often need to have the original pipeline re-run, while index failures should skip any pipelines. Further separating failures by index or specific failure type may also be beneficial.
354367

355-
**Perform a failure store rollover.** Consider rolling over the failure store before attempting to remediate failures. This will create a new failure index that will collect any new failures during the remediation process.
368+
**Perform a failure store rollover.** Consider [rolling over the failure store](./failure-store.md#failure-store-rollover-manage-failure-store-rollover) before attempting to remediate failures. This will create a new failure index that will collect any new failures during the remediation process.
356369

357370
**Use an ingest pipeline to convert failure documents back into their original document.** Failure documents store failure information along with the document that failed ingestion. The first step for remediating documents should be to use an ingest pipeline to extract the original source from the failure document and then discard any other information about the failure.
358371

359-
**Simulate first to avoid repeat failures.** If you must run a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline is applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure.
372+
**Simulate first to avoid repeat failures.** If you must run a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline is applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure. The easiest way to simulate these changes is via the [pipeline simulate API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-ingest-simulate) or the [simulate ingest API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-simulate-ingest).
360373

361374
### Remediating ingest node failures [failure-store-recipes-remediation-ingest]
362375

@@ -811,6 +824,7 @@ Since the failure store is enabled on this data stream, it would be wise to chec
811824
:::
812825

813826
::::{step} Done
827+
Once any failures have been remediated, you may wish to purge the failures from the failure store to clear up space and to avoid warnings about failed data that has already been replayed. Otherwise, your failures will stay available until the maximum failure store retention should you need to reference them.
814828
::::
815829

816830
:::::
@@ -1147,8 +1161,7 @@ Since the failure store is enabled on this data stream, it would be wise to chec
11471161
:::
11481162

11491163
::::{step} Done
1164+
Once any failures have been remediated, you may wish to purge the failures from the failure store to clear up space and to avoid warnings about failed data that has already been replayed. Otherwise, your failures will stay available until the maximum failure store retention should you need to reference them.
11501165
::::
11511166

11521167
:::::
1153-
1154-
Once any failures have been remediated, you may wish to purge the failures from the failure store to clear up space and to avoid warnings about failed data that has already been replayed. Otherwise, your failures will stay available until the maximum failure store retention should you need to reference them.

manage-data/data-store/data-streams/failure-store.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,18 @@
11
---
22
applies_to:
3-
stack: ga 8.19.0
4-
serverless: ga 9.1.0
3+
stack: ga 8.19.0, ga 9.1.0
4+
serverless: ga
55
---
66

77
# Failure store [failure-store]
88

99
A failure store is a secondary set of indices inside a data stream, dedicated to storing failed documents. A failed document is any document that, without the failure store enabled, would cause an ingest pipeline exception or that has a structure that conflicts with a data stream's mappings. In the absence of the failure store, a failed document would cause the indexing operation to fail, with an error message returned in the operation response.
1010

11-
When a data stream's failure store is enabled, these failures are instead captured in a separate index and persisted to be analysed later. Clients receive a successful response with a flag indicating the failure was redirected. Failure stores do not capture failures caused by backpressure or document version conflicts. These failures are always returned as-is since they warrant specific action by the client.
11+
When a data stream's failure store is enabled, these failures are instead captured in a separate index and persisted to be analysed later. Clients receive a successful response with a flag indicating the failure was redirected.
12+
13+
:::{important}
14+
Failure stores do not capture failures caused by backpressure or document version conflicts. These failures are always returned as-is since they warrant specific action by the client.
15+
:::
1216

1317
## Set up a data stream failure store [set-up-failure-store]
1418

@@ -19,7 +23,7 @@ Each data stream has its own failure store that can be enabled to accept failed
1923
You can specify in a data stream's [index template](../templates.md) if it should enable the failure store when it is first created.
2024

2125
:::{note}
22-
Unlike the `settings` and `mappings` fields on an [index template](../templates.md) which are repeatedly applied to new data stream write indices on rollover, the `data_stream_options` section of a template is applied to a data stream only once when the data stream is first created. To configure existing data streams, use the put [data stream options API](indices-put-data-stream-options).
26+
Unlike the `settings` and `mappings` fields on an [index template](../templates.md) which are repeatedly applied to new data stream write indices on rollover, the `data_stream_options` section of a template is applied to a data stream only once when the data stream is first created. To configure existing data streams, use the put [data stream options API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-data-stream-options).
2327
:::
2428

2529
To enable the failure store on a new data stream, enable it in the `data_stream_options` of the template:
@@ -99,16 +103,17 @@ PUT _cluster/settings
99103
}
100104
}
101105
```
106+
1. Enabling the failure stores for `my-datastream-*` and `logs-*`
107+
102108
```console
103109
PUT _data_stream/my-datastream-1/_options
104110
{
105111
"failure_store": {
106-
"enabled": false <2>
112+
"enabled": false <1>
107113
}
108114
}
109115
```
110-
1. Enabling the failure stores for `my-datastream-*` and `logs-*`
111-
2. The failure store for `my-datastream-1` is disabled even though it matches `my-datastream-*`. The data stream options override the cluster setting.
116+
1. The failure store for `my-datastream-1` is disabled even though it matches `my-datastream-*`. The data stream options override the cluster setting.
112117

113118
## Using a failure store [use-failure-store]
114119

0 commit comments

Comments
 (0)