Skip to content
Merged
115 changes: 24 additions & 91 deletions troubleshoot/elasticsearch/repeated-snapshot-failures.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,6 @@ mapped_pages:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/repeated-snapshot-failures.html
applies_to:
stack:
deployment:
eck:
ess:
ece:
self:
products:
- id: elasticsearch
---
Expand All @@ -17,105 +12,44 @@ products:

Repeated snapshot failures are usually an indicator of a problem with your deployment. Continuous failures of automated snapshots can leave a deployment without recovery options in cases of data loss or outages.

Elasticsearch keeps track of the number of repeated failures when executing automated snapshots. If an automated snapshot fails too many times without a successful execution, the health API will report a warning. The number of repeated failures before reporting a warning is controlled by the [`slm.health.failed_snapshot_warn_threshold`](elasticsearch://reference/elasticsearch/configuration-reference/snapshot-restore-settings.md#slm-health-failed-snapshot-warn-threshold) setting.
:::{include} /deploy-manage/_snippets/autoops-callout-with-ech.md
:::

In the event that an automated {{slm}} policy execution is experiencing repeated failures, follow these steps to get more information about the problem:
{{es}} keeps track of the number of repeated failures when executing automated snapshots with [{{slm}} ({{slm-init}})](/deploy-manage/tools/snapshot-and-restore/create-snapshots.md#automate-snapshots-slm) policies. If an automated snapshot fails too many times without a successful execution, the health API reports a warning. The number of repeated failures before reporting a warning is controlled by the [`slm.health.failed_snapshot_warn_threshold`](elasticsearch://reference/elasticsearch/configuration-reference/snapshot-restore-settings.md#slm-health-failed-snapshot-warn-threshold) setting.

:::::::{tab-set}
## Review snapshot policy failures

::::::{tab-item} {{ech}}
In order to check the status of failing {{slm}} policies we need to go to Kibana and retrieve the [Snapshot Lifecycle Policy information](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-slm-get-lifecycle).
If an automated {{slm-init}} policy execution is experiencing repeated failures, follow these steps to get more information about the problem:

**Use {{kib}}**

1. Log in to the [{{ecloud}} console](https://cloud.elastic.co?page=docs&placement=docs-body).
2. On the **Hosted deployments** panel, click the name of your deployment.
:::::::{tab-set}

::::{note}
If the name of your deployment is disabled your {{kib}} instances might be unhealthy, in which case contact [Elastic Support](https://support.elastic.co). If your deployment doesn’t include {{kib}}, all you need to do is [enable it first](../../deploy-manage/deploy/elastic-cloud/access-kibana.md).
::::
::::::{tab-item} Using {{kib}}
In {{kib}}, you can view all configured {{slm-init}} policies and review their status and execution history. If the UI does not provide sufficient details about the failure, use the Console to retrieve the [snapshot policy information](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-slm-get-lifecycle) with the {{es}} API.

3. Open your deployment’s side navigation menu (placed under the Elastic logo in the upper left corner) and go to **Dev Tools > Console**.
1. Go to **Snapshot and Restore > Policies** to see the list of configured policies. You can find the **Snapshot and Restore** management page using the navigation menu or the [global search field](/explore-analyze/find-and-organize/find-apps-and-objects.md).

:::{image} /troubleshoot/images/elasticsearch-reference-kibana-console.png
:::{image} /troubleshoot/images/elasticsearch-reference-slm-policies.png
:alt: {{kib}} Console
:screenshot:
:::

4. [Retrieve](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-slm-get-lifecycle) the {{slm}} policy:

```console
GET _slm/policy/<affected-policy-name>
```

The response will look like this:

```console-result
{
"affected-policy-name": { <1>
"version": 1,
"modified_date": "2099-05-06T01:30:00.000Z",
"modified_date_millis": 4081757400000,
"policy" : {
"schedule": "0 30 1 * * ?",
"name": "<daily-snap-{now/d}>",
"repository": "my_repository",
"config": {
"indices": ["data-*", "important"],
"ignore_unavailable": false,
"include_global_state": false
},
"retention": {
"expire_after": "30d",
"min_count": 5,
"max_count": 50
}
},
"last_success" : {
"snapshot_name" : "daily-snap-2099.05.30-tme_ivjqswgkpryvnao2lg",
"start_time" : 4083782400000,
"time" : 4083782400000
},
"last_failure" : { <2>
"snapshot_name" : "daily-snap-2099.06.16-ywe-kgh5rfqfrpnchvsujq",
"time" : 4085251200000, <3>
"details" : """{"type":"snapshot_exception","reason":"[daily-snap-2099.06.16-ywe-kgh5rfqfrpnchvsujq] failed to create snapshot successfully, 5 out of 149 total shards failed"}""" <4>
},
"stats": {
"policy": "daily-snapshots",
"snapshots_taken": 0,
"snapshots_failed": 0,
"snapshots_deleted": 0,
"snapshot_deletion_failures": 0
},
"next_execution": "2099-06-17T01:30:00.000Z",
"next_execution_millis": 4085343000000
}
}
```

1. The affected snapshot lifecycle policy.
2. The information about the last failure for the policy.
3. The time when the failure occurred in millis. Use the `human=true` request parameter to see a formatted timestamp.
4. Error details containing the reason for the snapshot failure.


Snapshots can fail for a variety reasons. If the failures are due to configuration errors, consult the documentation for the repository that the automated snapshots are using. Refer to the [guide on managing repositories in ECE](/deploy-manage/tools/snapshot-and-restore/cloud-enterprise.md) if you are using such a deployment.

2. The policies table lists all configured policies. Click on any of the policies to review the details and execution history.

One common failure scenario is repository corruption. This occurs most often when multiple instances of {{es}} write to the same repository location. There is a [separate troubleshooting guide](diagnosing-corrupted-repositories.md) to fix this problem.
3. To get more detailed information about the failure, open {{kib}} **Dev Tools > Console**. You can find the **Console** using the navigation menu or the [global search field](/explore-analyze/find-and-organize/find-apps-and-objects.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other Troubleshooting pages where a similar approach was used (telling users to use Console), we separated out the common steps to only list them once. Step 4. in the Kibana tab is the exact same step as the one in the API tab.

Maybe we could remove it from the Kibana tab and in the API tab, you can add a statement to say something like:

You can run the following steps using either [API console](/explore-analyze/query-filter/tools/console.md) or direct [Elasticsearch API](elasticsearch://reference/elasticsearch/rest-apis/index.md) calls.

Copy link
Contributor Author

@eedugon eedugon Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, definitely. I'll update the tabs to remove content duplication.


In the event that snapshots are failing for other reasons check the logs on the elected master node during the snapshot execution period for more information.
Once the Console is open, execute the steps described in the **Using the {{es}} API** tab to retrieve the affected {{slm-init}} policy information.
::::::

::::::{tab-item} Self-managed
[Retrieve](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-slm-get-lifecycle) the {{slm}} policy:
::::::{tab-item} Using the {{es}} API
The following step can be run using either [{{kib}} console](/explore-analyze/query-filter/tools/console.md) or direct [{{es}} API](elasticsearch://reference/elasticsearch/rest-apis/index.md) calls.

[Retrieve the affected {{slm-init}} policy](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-slm-get-lifecycle):

```console
GET _slm/policy/<affected-policy-name>
```

The response will look like this:
The response looks like this:

```console-result
{
Expand Down Expand Up @@ -166,15 +100,14 @@ The response will look like this:
3. The time when the failure occurred in millis. Use the `human=true` request parameter to see a formatted timestamp.
4. Error details containing the reason for the snapshot failure.

::::::

Snapshots can fail for a variety reasons. If the failures are due to configuration errors, consult the documentation for the repository that the automated snapshots are using.
:::::::

One common failure scenario is repository corruption. This occurs most often when multiple instances of {{es}} write to the same repository location. There is a [separate troubleshooting guide](diagnosing-corrupted-repositories.md) to fix this problem.
## Possible causes

In the event that snapshots are failing for other reasons check the logs on the elected master node during the snapshot execution period for more information.
::::::
Snapshots can fail for a variety of reasons. If the failures are due to configuration errors, consult the documentation for the repository type that the snapshot policy is using. Refer to the [guide on managing repositories in ECE](/deploy-manage/tools/snapshot-and-restore/cloud-enterprise.md) if you are using an Elastic Cloud Enterprise deployment.

:::::::
One common failure scenario is repository corruption. This occurs most often when multiple instances of {{es}} write to the same repository location. There is a [separate troubleshooting guide](diagnosing-corrupted-repositories.md) to fix this problem.

:::{include} /deploy-manage/_snippets/autoops-callout-with-ech.md
:::
If snapshots are failing for other reasons check the logs on the elected master node during the snapshot execution period for more information.
Loading
Loading