Review comments

jbaiera · jbaiera · commit d521fd72b75a · 2025-06-25T15:04:27.000-04:00
diff --git a/manage-data/data-store/data-streams/failure-store-recipes.md b/manage-data/data-store/data-streams/failure-store-recipes.md
@@ -30,9 +30,10 @@ POST my-datastream-ingest/_doc
   },
   "_seq_no": 2,
   "_primary_term": 1,
-  "failure_store": "used" // The document was sent to the failure store.
+  "failure_store": "used" <1>
 }
 ```
+1. The document was sent to the failure store.
 
 Now we search the failure store to check the failure document to see what went wrong.
 ```console
@@ -64,34 +65,40 @@ GET my-datastream-ingest::failures/_search
           "@timestamp": "2025-05-09T06:24:48.381Z",
           "document": {
             "index": "my-datastream-ingest",
-            "source": { // When an ingest pipeline fails, the document stored is what was originally sent to the cluster.
+            "source": { <1>
               "important": {
-                "info": "The rain in Spain falls mainly on the plain" // The important information that we failed to find was originally present in the document.
+                "info": "The rain in Spain falls mainly on the plain" <2>
               },
               "@timestamp": "2025-04-21T00:00:00Z"
             }
           },
           "error": {
             "type": "illegal_argument_exception",
-            "message": "field [info] not present as part of path [important.info]", // The info field was not present when the failure occurred.
+            "message": "field [info] not present as part of path [important.info]", <3>
             "stack_trace": """j.l.IllegalArgumentException: field [info] not present as part of path [important.info]
 	at o.e.i.IngestDocument.getFieldValue(IngestDocument.java:202)
 	at o.e.i.c.SetProcessor.execute(SetProcessor.java:86)
 	... 19 more
 """,
-            "pipeline_trace": [ // The first pipeline called the second pipeline.
+            "pipeline_trace": [ <4>
               "ingest-step-1",
               "ingest-step-2"
             ],
-            "pipeline": "ingest-step-2", // The document failed in the second pipeline.
-            "processor_type": "set" // It failed in the pipeline's set processor.
+            "pipeline": "ingest-step-2", <5>
+            "processor_type": "set" <6>
           }
         }
       }
     ]
   }
 }
 ```
+1. When an ingest pipeline fails, the document stored is what was originally sent to the cluster.
+2. The important information that we failed to find was originally present in the document.
+3. The info field was not present when the failure occurred.
+4. The first pipeline called the second pipeline.
+5. The document failed in the second pipeline.
+6. It failed in the pipeline's set processor.
 
 Despite not knowing the pipelines beforehand, we have some places to start looking. The `ingest-step-2` pipeline cannot find the `important.info` field despite it being present in the document that was sent to the cluster. If we pull that pipeline definition we find the following:
 
@@ -104,15 +111,17 @@ GET _ingest/pipeline/ingest-step-2
   "ingest-step-2": {
     "processors": [
       {
-        "set": { // There is only one processor here.
+        "set": { <1>
           "field": "copy.info",
-          "copy_from": "important.info" // This field was missing from the document at this point. 
+          "copy_from": "important.info" <2> 
         }
       }
     ]
   }
 }
 ```
+1. There is only one processor here.
+2. This field was missing from the document at this point.
 
 There is only a set processor in the `ingest-step-2` pipeline so this is likely not where the root problem is. Remembering the `pipeline_trace` field on the failure we find that `ingest-step-1` was the original pipeline called for this document. It is likely the data stream's default pipeline. Pulling its definition we find the following:
 
@@ -126,18 +135,20 @@ GET _ingest/pipeline/ingest-step-1
     "processors": [
       {
         "remove": {
-          "field": "important.info" // A remove processor that is incorrectly removing our important field.
+          "field": "important.info" <1>
         }
       },
       {
         "pipeline": {
-          "name": "ingest-step-2" // The call to the second pipeline.
+          "name": "ingest-step-2" <2>
         }
       }
     ]
   }
 }
 ```
+1. A remove processor that is incorrectly removing our important field.
+2. The call to the second pipeline.
 
 We find a remove processor in the first pipeline that is the root cause of the problem! The pipeline should be updated to not remove important data, or the downstream pipeline should be changed to not expect the important data to be always present.
 
@@ -267,15 +278,17 @@ GET my-datastream-ingest::failures/_search
               "complicated-processor"
             ],
             "pipeline": "complicated-processor",
-            "processor_type": "set", // Helpful, but which set processor on the pipeline could it be?
-            "processor_tag": "copy to new counter again" // The tag of the exact processor that the document failed on.
+            "processor_type": "set", <1>
+            "processor_tag": "copy to new counter again" <2>
           }
         }
       }
     ]
   }
 }
 ```
+1. Helpful, but which set processor on the pipeline could it be?
+2. The tag of the exact processor that the document failed on.
 
 Without tags in place it would not be as clear where in the pipeline the indexing problem occurred. Tags provide a unique identifier for a processor that can be quickly referenced in case of an ingest failure.
 
@@ -352,11 +365,11 @@ We recommend a few best practices for remediating failure data.
 
 **Separate your failures beforehand.** As described in the previous [failure document source](./failure-store.md#use-failure-store-document-source) section, failure documents are structured differently depending on when the document failed during ingestion. We recommend to separate documents by ingest pipeline failures and indexing failures at minimum. Ingest pipeline failures often need to have the original pipeline re-run, while index failures should skip any pipelines. Further separating failures by index or specific failure type may also be beneficial.
 
-**Perform a failure store rollover.** Consider rolling over the failure store before attempting to remediate failures. This will create a new failure index that will collect any new failures during the remediation process.
+**Perform a failure store rollover.** Consider [rolling over the failure store](./failure-store.md#failure-store-rollover-manage-failure-store-rollover) before attempting to remediate failures. This will create a new failure index that will collect any new failures during the remediation process.
 
 **Use an ingest pipeline to convert failure documents back into their original document.** Failure documents store failure information along with the document that failed ingestion. The first step for remediating documents should be to use an ingest pipeline to extract the original source from the failure document and then discard any other information about the failure.
 
-**Simulate first to avoid repeat failures.** If you must run a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline is applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure.
+**Simulate first to avoid repeat failures.** If you must run a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline is applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure. The easiest way to simulate these changes is via the [pipeline simulate API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-ingest-simulate) or the [simulate ingest API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-simulate-ingest).
 
 ### Remediating ingest node failures [failure-store-recipes-remediation-ingest]
 
@@ -811,6 +824,7 @@ Since the failure store is enabled on this data stream, it would be wise to chec
 :::
 
 ::::{step} Done
+Once any failures have been remediated, you may wish to purge the failures from the failure store to clear up space and to avoid warnings about failed data that has already been replayed. Otherwise, your failures will stay available until the maximum failure store retention should you need to reference them.
 ::::
 
 :::::
@@ -1147,8 +1161,7 @@ Since the failure store is enabled on this data stream, it would be wise to chec
 :::
 
 ::::{step} Done
+Once any failures have been remediated, you may wish to purge the failures from the failure store to clear up space and to avoid warnings about failed data that has already been replayed. Otherwise, your failures will stay available until the maximum failure store retention should you need to reference them.
 ::::
 
 :::::
-
-Once any failures have been remediated, you may wish to purge the failures from the failure store to clear up space and to avoid warnings about failed data that has already been replayed. Otherwise, your failures will stay available until the maximum failure store retention should you need to reference them.
diff --git a/manage-data/data-store/data-streams/failure-store.md b/manage-data/data-store/data-streams/failure-store.md
@@ -1,14 +1,18 @@
 ---
 applies_to:
-  stack: ga 8.19.0
-  serverless: ga 9.1.0
+  stack: ga 8.19.0, ga 9.1.0
+  serverless: ga
 ---
 
 # Failure store [failure-store]
 
 A failure store is a secondary set of indices inside a data stream, dedicated to storing failed documents. A failed document is any document that, without the failure store enabled, would cause an ingest pipeline exception or that has a structure that conflicts with a data stream's mappings. In the absence of the failure store, a failed document would cause the indexing operation to fail, with an error message returned in the operation response.
 
-When a data stream's failure store is enabled, these failures are instead captured in a separate index and persisted to be analysed later. Clients receive a successful response with a flag indicating the failure was redirected. Failure stores do not capture failures caused by backpressure or document version conflicts. These failures are always returned as-is since they warrant specific action by the client.
+When a data stream's failure store is enabled, these failures are instead captured in a separate index and persisted to be analysed later. Clients receive a successful response with a flag indicating the failure was redirected. 
+
+:::{important}
+Failure stores do not capture failures caused by backpressure or document version conflicts. These failures are always returned as-is since they warrant specific action by the client.
+:::
 
 ## Set up a data stream failure store [set-up-failure-store]
 
@@ -19,7 +23,7 @@ Each data stream has its own failure store that can be enabled to accept failed
 You can specify in a data stream's [index template](../templates.md) if it should enable the failure store when it is first created.
 
 :::{note}
-Unlike the `settings` and `mappings` fields on an [index template](../templates.md) which are repeatedly applied to new data stream write indices on rollover, the `data_stream_options` section of a template is applied to a data stream only once when the data stream is first created. To configure existing data streams, use the put [data stream options API](indices-put-data-stream-options).
+Unlike the `settings` and `mappings` fields on an [index template](../templates.md) which are repeatedly applied to new data stream write indices on rollover, the `data_stream_options` section of a template is applied to a data stream only once when the data stream is first created. To configure existing data streams, use the put [data stream options API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-data-stream-options).
 :::
 
 To enable the failure store on a new data stream, enable it in the `data_stream_options` of the template:
@@ -99,16 +103,17 @@ PUT _cluster/settings
   }
 }
 ```
+1. Enabling the failure stores for `my-datastream-*` and `logs-*`
+
 ```console
 PUT _data_stream/my-datastream-1/_options
 {
   "failure_store": {
-    "enabled": false <2>
+    "enabled": false <1>
   }
 }
 ```
-1. Enabling the failure stores for `my-datastream-*` and `logs-*`
-2. The failure store for `my-datastream-1` is disabled even though it matches `my-datastream-*`. The data stream options override the cluster setting.
+1. The failure store for `my-datastream-1` is disabled even though it matches `my-datastream-*`. The data stream options override the cluster setting.
 
 ## Using a failure store [use-failure-store]
 

Original file line number	Diff line number	Diff line change
`@@ -30,9 +30,10 @@ POST my-datastream-ingest/_doc`
`30`	`30`	`},`
`31`	`31`	`"_seq_no": 2,`
`32`	`32`	`"_primary_term": 1,`
`33`		`- "failure_store": "used" // The document was sent to the failure store.`
	`33`	`+ "failure_store": "used" <1>`
`34`	`34`	`}`
`35`	`35`	```
	`36`	`+1. The document was sent to the failure store.`
`36`	`37`
`37`	`38`	`Now we search the failure store to check the failure document to see what went wrong.`
`38`	`39`	```console
`@@ -64,34 +65,40 @@ GET my-datastream-ingest::failures/_search`
`64`	`65`	`"@timestamp": "2025-05-09T06:24:48.381Z",`
`65`	`66`	`"document": {`
`66`	`67`	`"index": "my-datastream-ingest",`
`67`		`- "source": { // When an ingest pipeline fails, the document stored is what was originally sent to the cluster.`
	`68`	`+ "source": { <1>`
`68`	`69`	`"important": {`
`69`		`- "info": "The rain in Spain falls mainly on the plain" // The important information that we failed to find was originally present in the document.`
	`70`	`+ "info": "The rain in Spain falls mainly on the plain" <2>`
`70`	`71`	`},`
`71`	`72`	`"@timestamp": "2025-04-21T00:00:00Z"`
`72`	`73`	`}`
`73`	`74`	`},`
`74`	`75`	`"error": {`
`75`	`76`	`"type": "illegal_argument_exception",`
`76`		`- "message": "field [info] not present as part of path [important.info]", // The info field was not present when the failure occurred.`
	`77`	`+ "message": "field [info] not present as part of path [important.info]", <3>`
`77`	`78`	`"stack_trace": """j.l.IllegalArgumentException: field [info] not present as part of path [important.info]`
`78`	`79`	`at o.e.i.IngestDocument.getFieldValue(IngestDocument.java:202)`
`79`	`80`	`at o.e.i.c.SetProcessor.execute(SetProcessor.java:86)`
`80`	`81`	`... 19 more`
`81`	`82`	`""",`
`82`		`- "pipeline_trace": [ // The first pipeline called the second pipeline.`
	`83`	`+ "pipeline_trace": [ <4>`
`83`	`84`	`"ingest-step-1",`
`84`	`85`	`"ingest-step-2"`
`85`	`86`	`],`
`86`		`- "pipeline": "ingest-step-2", // The document failed in the second pipeline.`
`87`		`- "processor_type": "set" // It failed in the pipeline's set processor.`
	`87`	`+ "pipeline": "ingest-step-2", <5>`
	`88`	`+ "processor_type": "set" <6>`
`88`	`89`	`}`
`89`	`90`	`}`
`90`	`91`	`}`
`91`	`92`	`]`
`92`	`93`	`}`
`93`	`94`	`}`
`94`	`95`	```
	`96`	`+1. When an ingest pipeline fails, the document stored is what was originally sent to the cluster.`
	`97`	`+2. The important information that we failed to find was originally present in the document.`
	`98`	`+3. The info field was not present when the failure occurred.`
	`99`	`+4. The first pipeline called the second pipeline.`
	`100`	`+5. The document failed in the second pipeline.`
	`101`	`+6. It failed in the pipeline's set processor.`
`95`	`102`
`96`	`103`	Despite not knowing the pipelines beforehand, we have some places to start looking. The `ingest-step-2` pipeline cannot find the `important.info` field despite it being present in the document that was sent to the cluster. If we pull that pipeline definition we find the following:
`97`	`104`
`@@ -104,15 +111,17 @@ GET _ingest/pipeline/ingest-step-2`
`104`	`111`	`"ingest-step-2": {`
`105`	`112`	`"processors": [`
`106`	`113`	`{`
`107`		`- "set": { // There is only one processor here.`
	`114`	`+ "set": { <1>`
`108`	`115`	`"field": "copy.info",`
`109`		`- "copy_from": "important.info" // This field was missing from the document at this point.`
	`116`	`+ "copy_from": "important.info" <2>`
`110`	`117`	`}`
`111`	`118`	`}`
`112`	`119`	`]`
`113`	`120`	`}`
`114`	`121`	`}`
`115`	`122`	```
	`123`	`+1. There is only one processor here.`
	`124`	`+2. This field was missing from the document at this point.`
`116`	`125`
`117`	`126`	There is only a set processor in the `ingest-step-2` pipeline so this is likely not where the root problem is. Remembering the `pipeline_trace` field on the failure we find that `ingest-step-1` was the original pipeline called for this document. It is likely the data stream's default pipeline. Pulling its definition we find the following:
`118`	`127`
`@@ -126,18 +135,20 @@ GET _ingest/pipeline/ingest-step-1`
`126`	`135`	`"processors": [`
`127`	`136`	`{`
`128`	`137`	`"remove": {`
`129`		`- "field": "important.info" // A remove processor that is incorrectly removing our important field.`
	`138`	`+ "field": "important.info" <1>`
`130`	`139`	`}`
`131`	`140`	`},`
`132`	`141`	`{`
`133`	`142`	`"pipeline": {`
`134`		`- "name": "ingest-step-2" // The call to the second pipeline.`
	`143`	`+ "name": "ingest-step-2" <2>`
`135`	`144`	`}`
`136`	`145`	`}`
`137`	`146`	`]`
`138`	`147`	`}`
`139`	`148`	`}`
`140`	`149`	```
	`150`	`+1. A remove processor that is incorrectly removing our important field.`
	`151`	`+2. The call to the second pipeline.`
`141`	`152`
`142`	`153`	`We find a remove processor in the first pipeline that is the root cause of the problem! The pipeline should be updated to not remove important data, or the downstream pipeline should be changed to not expect the important data to be always present.`
`143`	`154`
`@@ -267,15 +278,17 @@ GET my-datastream-ingest::failures/_search`
`267`	`278`	`"complicated-processor"`
`268`	`279`	`],`
`269`	`280`	`"pipeline": "complicated-processor",`
`270`		`- "processor_type": "set", // Helpful, but which set processor on the pipeline could it be?`
`271`		`- "processor_tag": "copy to new counter again" // The tag of the exact processor that the document failed on.`
	`281`	`+ "processor_type": "set", <1>`
	`282`	`+ "processor_tag": "copy to new counter again" <2>`
`272`	`283`	`}`
`273`	`284`	`}`
`274`	`285`	`}`
`275`	`286`	`]`
`276`	`287`	`}`
`277`	`288`	`}`
`278`	`289`	```
	`290`	`+1. Helpful, but which set processor on the pipeline could it be?`
	`291`	`+2. The tag of the exact processor that the document failed on.`
`279`	`292`
`280`	`293`	`Without tags in place it would not be as clear where in the pipeline the indexing problem occurred. Tags provide a unique identifier for a processor that can be quickly referenced in case of an ingest failure.`
`281`	`294`
`@@ -352,11 +365,11 @@ We recommend a few best practices for remediating failure data.`
`352`	`365`
`353`	`366`	Separate your failures beforehand. As described in the previous [failure document source](./failure-store.md#use-failure-store-document-source) section, failure documents are structured differently depending on when the document failed during ingestion. We recommend to separate documents by ingest pipeline failures and indexing failures at minimum. Ingest pipeline failures often need to have the original pipeline re-run, while index failures should skip any pipelines. Further separating failures by index or specific failure type may also be beneficial.
`354`	`367`
`355`		`-Perform a failure store rollover. Consider rolling over the failure store before attempting to remediate failures. This will create a new failure index that will collect any new failures during the remediation process.`
	`368`	`+Perform a failure store rollover. Consider [rolling over the failure store](./failure-store.md#failure-store-rollover-manage-failure-store-rollover) before attempting to remediate failures. This will create a new failure index that will collect any new failures during the remediation process.`
`356`	`369`
`357`	`370`	`Use an ingest pipeline to convert failure documents back into their original document. Failure documents store failure information along with the document that failed ingestion. The first step for remediating documents should be to use an ingest pipeline to extract the original source from the failure document and then discard any other information about the failure.`
`358`	`371`
`359`		`-Simulate first to avoid repeat failures. If you must run a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline is applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure.`
	`372`	+Simulate first to avoid repeat failures. If you must run a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline is applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure. The easiest way to simulate these changes is via the [pipeline simulate API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-ingest-simulate) or the [simulate ingest API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-simulate-ingest).
`360`	`373`
`361`	`374`	`### Remediating ingest node failures [failure-store-recipes-remediation-ingest]`
`362`	`375`
`@@ -811,6 +824,7 @@ Since the failure store is enabled on this data stream, it would be wise to chec`
`811`	`824`	`:::`
`812`	`825`
`813`	`826`	`::::{step} Done`
	`827`	`+Once any failures have been remediated, you may wish to purge the failures from the failure store to clear up space and to avoid warnings about failed data that has already been replayed. Otherwise, your failures will stay available until the maximum failure store retention should you need to reference them.`
`814`	`828`	`::::`
`815`	`829`
`816`	`830`	`:::::`
`@@ -1147,8 +1161,7 @@ Since the failure store is enabled on this data stream, it would be wise to chec`
`1147`	`1161`	`:::`
`1148`	`1162`
`1149`	`1163`	`::::{step} Done`
	`1164`	`+Once any failures have been remediated, you may wish to purge the failures from the failure store to clear up space and to avoid warnings about failed data that has already been replayed. Otherwise, your failures will stay available until the maximum failure store retention should you need to reference them.`
`1150`	`1165`	`::::`
`1151`	`1166`
`1152`	`1167`	`:::::`
`1153`		`-`
`1154`		`-Once any failures have been remediated, you may wish to purge the failures from the failure store to clear up space and to avoid warnings about failed data that has already been replayed. Otherwise, your failures will stay available until the maximum failure store retention should you need to reference them.`