Skip to content

Commit 17707b3

Browse files
authored
Clean up ech/ece troubleshooting content (#925)
Follow up on issues raised in #880 (thank you again @florent-leborgne!) TODO: create issue to reconsolidate later
1 parent 1742df2 commit 17707b3

File tree

10 files changed

+205
-56
lines changed

10 files changed

+205
-56
lines changed

redirects.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,4 +85,6 @@ redirects:
8585
'reference/ingestion-tools/fleet/migrate-from-beats-to-elastic-agent.md': 'reference/fleet/migrate-from-beats-to-elastic-agent.md'
8686

8787
## troubleshoot
88-
'troubleshoot/deployments/cloud-enterprise/ask-for-help.md': 'troubleshoot/index.md'
88+
'troubleshoot/deployments/cloud-enterprise/ask-for-help.md': 'troubleshoot/index.md'
89+
'troubleshoot/deployments/serverless-status.md': 'troubleshoot/deployments/serverless.md'
90+
'troubleshoot/deployments/esf/elastic-serverless-forwarder.md': 'troubleshoot/ingest/elastic-serverless-forwarder.md'
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
---
2+
navigation_title: "Deployment health warnings"
3+
applies_to:
4+
deployment:
5+
ece: all
6+
mapped_pages:
7+
- https://www.elastic.co/guide/en/cloud-enterprise/current/ece-deployment-no-op.html
8+
---
9+
10+
# Troubleshoot deployment health warnings [ece-deployment-no-op]
11+
12+
The {{ece}} **Deployments** page shows the current status of your active deployments. From time to time you may get one or more health warnings, such as the following:
13+
14+
:::{image} /troubleshoot/images/cloud-ec-ce-deployment-health-warning.png
15+
:alt: A screen capture of the deployment page showing a typical warning: Deployment health warning: Latest change to {{es}} configuration failed.
16+
:::
17+
18+
**Single warning**
19+
20+
To resolve a single health warning, we recommended first running a _no-op_ (no operation) plan. This performs a rolling update on the components in your Elastic Cloud Enterprise deployment without actually applying any configuration changes. This is often all that’s needed to resolve a health warning on the UI.
21+
22+
To run a no-op plan:
23+
24+
1. [Log into the Cloud UI](https://www.elastic.co/guide/en/cloud-enterprise/current/ece-login.html).
25+
2. Select a deployment.
26+
27+
Narrow the list by name, ID, or choose from several other filters. To further define the list, use a combination of filters.
28+
29+
3. From your deployment menu, go to the **Edit** page.
30+
4. Select **Save**.
31+
32+
**Multiple warnings**
33+
34+
If multiple health warnings appear for one of your deployments, check [](/troubleshoot/deployments/cloud-enterprise/common-issues.md) or [contact us](/troubleshoot/index.md#contact-us).
35+
36+
## Additional resources
37+
* [Elastic Cloud Hosted deployment health warnings](/troubleshoot/monitoring/deployment-health-warnings.md)
38+
* [Troubleshooting overview](/troubleshoot/index.md)
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
---
2+
navigation_title: "Node bootlooping"
3+
applies_to:
4+
deployment:
5+
ece: all
6+
mapped_pages:
7+
- https://www.elastic.co/guide/en/cloud-enterprise/current/ece-config-change-errors.html
8+
---
9+
10+
# Troubleshoot node bootlooping in {{ece}} [ece-config-change-errors]
11+
12+
When you attempt to apply a configuration change to a deployment, the attempt may fail with an error indicating that the change could not be applied, and deployment resources may be unable to restart. In some cases, bootlooping may result, where the deployment resources cycle through a continual reboot process.
13+
14+
:::{image} /troubleshoot/images/cloud-ec-ce-configuration-change-failure.png
15+
:alt: A screen capture of the deployment page showing an error: Latest change to {{es}} configuration failed.
16+
:::
17+
18+
To confirm if your Elasticsearch cluster is bootlooping, you can check the most recent plan under your [Deployment Activity page](/deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) for the error:
19+
20+
```sh
21+
Plan change failed: Some instances were unable to start properly.
22+
```
23+
24+
Here are some frequent causes of a failed configuration change:
25+
26+
* [Secure settings](#ece-config-change-errors-secure-settings)
27+
* [Expired custom plugins or bundles](#ece-config-change-errors-expired-bundle-extension)
28+
* [OOM errors](#ece-config-change-errors-oom-errors)
29+
* [Existing index](#ece-config-change-errors-existing-index)
30+
* [Insufficient storage](#ece-config-change-errors-insufficient-storage)
31+
32+
If you’re unable to remediate the failing plan’s root cause, you can attempt to reset the deployment to the latest successful {{es}} configuration by performing a [no-op plan](/troubleshoot/monitoring/deployment-health-warnings.md).
33+
34+
## Secure settings [ece-config-change-errors-secure-settings]
35+
36+
The most frequent cause of a failed deployment configuration change is due to invalid or mislocated [secure settings](/deploy-manage/security/secure-settings.md).
37+
The keystore allows you to safely store sensitive settings, such as passwords, as a key/value pair. You can then access a secret value from a settings file by referencing its key. Importantly, not all settings can be stored in the keystore, and the keystore does not validate the settings that you add. Adding unsupported settings can cause {{es}} or other components to fail to restart. To check whether a setting is supported in the keystore, look for a "Secure" qualifier in the [lists of reloadable settings](/deploy-manage/security/secure-settings.md).
38+
39+
The following sections detail some secure settings problems that can result in a configuration change error that can prevent a deployment from restarting. You might diagnose these plan failures via the logs or via their [related exit codes](/deploy-manage/maintenance/start-stop-services/start-stop-elasticsearch.md#fatal-errors) `1`, `3`, and `78`.
40+
41+
42+
### Invalid or outdated values [ece-config-change-errors-old-values]
43+
44+
The keystore does not validate any settings that you add, so invalid or outdated values are a common source of errors when you apply a configuration change to a deployment.
45+
46+
To check the current set of stored settings:
47+
48+
1. Open the deployment **Security** page.
49+
2. In the **{{es}} keystore** section, check the **Security keys** list. The list is shown only if you currently have settings configured in the keystore.
50+
51+
One frequent cause of errors is when settings in the keystore are no longer valid, such as when SAML settings are added for a test environment, but the settings are either not carried over or no longer valid in a production environment.
52+
53+
54+
### Snapshot repositories [ece-config-change-errors-snapshot-repos]
55+
56+
Sometimes, settings added to the keystore to connect to a snapshot repository may not be valid. When this happens, you may get an error such as `SettingsException[Neither a secret key nor a shared access token was set.]`
57+
58+
For example, when adding an [Azure repository storage setting](/deploy-manage/tools/snapshot-and-restore/azure-repository.md#repository-azure-usage) such as `azure.client.default.account` to the keystore, the associated setting `azure.client.default.key` must also be added for the configuration to be valid.
59+
60+
61+
### Third-party authentication [ece-config-change-errors-third-party-auth]
62+
63+
When you configure third-party authentication, it’s important that all required configuration elements that are stored in the keystore are included in the {{es}} user settings file. For example, when you [create a SAML realm](/deploy-manage/users-roles/cluster-or-deployment-auth/saml.md#saml-create-realm), omitting a field such as `idp.entity_id` when that setting is present in the keystore results in a failed configuration change.
64+
65+
66+
### Wrong location [ece-config-change-errors-wrong-location]
67+
68+
In some cases, settings may accidentally be added to the keystore that should have been added to the [{{es}} user settings file](/deploy-manage/deploy/elastic-cloud/edit-stack-settings.md). It’s always a good idea to check the [lists of reloadable settings](/deploy-manage/security/secure-settings.md) to determine if a setting can be stored in the keystore. Settings that can safely be added to the keystore are flagged as `Secure`.
69+
70+
71+
## Expired custom plugins or bundles [ece-config-change-errors-expired-bundle-extension]
72+
73+
During the process of applying a configuration change, {{ecloud}} checks to determine if any [uploaded custom plugins or bundles](/deploy-manage/deploy/elastic-cloud/upload-custom-plugins-bundles.md) are expired.
74+
75+
Problematic plugins produce oscillating {{es}} start-up logs like the following:
76+
77+
```sh
78+
Booting at Sun Sep 4 03:06:43 UTC 2022
79+
Installing user plugins.
80+
Installing elasticsearch-analysis-izumo-master-7.10.2-20210618-28f8a97...
81+
/app/elasticsearch.sh: line 169: [: too many arguments
82+
Booting at Sun Sep 4 03:06:58 UTC 2022
83+
Installing user plugins.
84+
Installing elasticsearch-analysis-izumo-master-7.10.2-20210618-28f8a97...
85+
/app/elasticsearch.sh: line 169: [: too many arguments
86+
```
87+
88+
Problematic bundles produce similar oscillations but their install log would appear like
89+
90+
```sh
91+
2024-11-17 15:18:02 https://found-user-plugins.s3.amazonaws.com/XXXXX/XXXXX.zip?response-content-disposition=attachment%3Bfilename%XXXXX%2F4007535947.zip&x-elastic-extension-version=1574194077471&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20241016T133214Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86400&XAmz-Credential=XXXXX%2F20201016%2Fus-east-1%2Fs3%2Faws4_request&X-AmzSignature=XXXXX
92+
```
93+
94+
Noting in example that the bundle’s expiration `X-Amz-Date=20241016T133214Z` is before than the log timestamp `2024-11-17 15:18:02` so this bundle is considered expired.
95+
96+
To view any added plugins or bundles:
97+
98+
1. Go to the **Features** page and open the **Extensions** tab.
99+
2. Select any extension and then choose **Update extension** to renew it. No other changes are needed, and any associated configuration change failures should now be able to succeed.
100+
101+
102+
## OOM errors [ece-config-change-errors-oom-errors]
103+
104+
Configuration change errors can occur when there is insufficient RAM configured for a data tier. In this case, the cluster typically also shows OOM (out of memory) errors. To resolve these, you need to increase the amount of heap memory, which is half of the amount of memory allocated to a cluster. You might also detect OOM in plan changes via their [related exit codes](/deploy-manage/maintenance/start-stop-services/start-stop-elasticsearch.md#fatal-errors) `127`, `137`, and `158`.
105+
106+
You can also read our detailed blog [Managing and troubleshooting {{es}} memory](https://www.elastic.co/blog/managing-and-troubleshooting-elasticsearch-memory).
107+
108+
109+
## Existing index [ece-config-change-errors-existing-index]
110+
111+
In rare cases, when you attempt to upgrade the version of a deployment and the upgrade fails on the first attempt, subsequent attempts to upgrade may fail due to already existing resources. The problem may be due to the system preventing itself from overwriting existing indices, resulting in an error such as this: `Another Kibana instance appears to be migrating the index. Waiting for that migration to complete. If no other Kibana instance is attempting migrations, you can get past this message by deleting index .kibana_2 and restarting Kibana`.
112+
113+
To resolve this:
114+
115+
1. Check that you don’t need the content.
116+
2. Run an {{es}} [Delete index request](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-delete) to remove the existing index.
117+
118+
In this example, the `.kibana_2` index is the rollover of saved objects (such as Kibana visualizations or dashboards) from the original `.kibana_1` index. Since `.kibana_2` was created as part of the failed upgrade process, this index does not yet contain any pertinent data and it can safely be deleted.
119+
120+
3. Retry the deployment configuration change.
121+
122+
123+
## Insufficient storage [ece-config-change-errors-insufficient-storage]
124+
125+
Configuration change errors can occur when there is insufficient disk space for a data tier. To resolve this, you need to increase the size of that tier to ensure it provides enough storage to accommodate the data in your cluster tier considering the [high watermark](elasticsearch://reference/elasticsearch/configuration-reference/cluster-level-shard-allocation-routing-settings.md#disk-based-shard-allocation). For troubleshooting walkthrough, see [Fix watermark errors](/troubleshoot/elasticsearch/fix-watermark-errors.md).

troubleshoot/deployments/serverless-status.md

Lines changed: 0 additions & 31 deletions
This file was deleted.
Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,35 @@
11
---
2-
navigation_title: "Serverless"
2+
navigation_title: "Serverless status"
33
applies_to:
44
serverless: all
5+
mapped_pages:
6+
- https://www.elastic.co/guide/en/serverless/current/general-serverless-status.html
57
---
68

7-
# Troubleshoot {{serverless-full}}
9+
# Check Serverless status and get updates [general-serverless-status]
810

9-
Use the topics in this section to troubleshoot {{serverless-full}}:
11+
Serverless projects run on cloud platforms, which may undergo changes in availability. When availability changes, Elastic makes sure to provide you with a current service status.
12+
13+
To check current and past service availability, go to the Elastic [service status](https://status.elastic.co/?section=serverless) page.
14+
15+
16+
## Subscribe to updates [general-serverless-status-subscribe-to-updates]
17+
18+
You can be notified about changes to the service status automatically.
19+
20+
To receive service status updates:
21+
22+
1. Go to the Elastic [service status](https://status.elastic.co/?section=serverless) page.
23+
2. Select **SUBSCRIBE TO UPDATES**.
24+
3. You can be notified in the following ways:
25+
26+
* Email
27+
* Slack
28+
* Atom or RSS feeds
29+
30+
31+
After you subscribe, you’ll be notified whenever a service status update is posted.
1032

11-
* [](/troubleshoot/deployments/serverless-status.md)
12-
* [](/troubleshoot/deployments/esf/elastic-serverless-forwarder.md)
1333

1434
## Additional resources
1535
[Troubleshooting overview](/troubleshoot/index.md)

troubleshoot/ingest.md

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,9 @@ applies_to:
77

88
# Troubleshoot ingestion tools
99

10-
:::{admonition} WIP
11-
⚠️ **This page is a work in progress.** ⚠️
12-
13-
The documentation team is working on restructuring this section.
14-
:::
15-
1610
Use the topics in this section to troubleshoot ingestion tools:
1711

1812
* [](/troubleshoot/ingest/logstash.md)
1913
* [](/troubleshoot/ingest/fleet/fleet-elastic-agent.md)
2014
* [](/troubleshoot/ingest/beats-loggingplugin/elastic-logging-plugin-for-docker.md)
15+
* [](/troubleshoot/ingest/elastic-serverless-forwarder.md)
File renamed without changes.

troubleshoot/monitoring/deployment-health-warnings.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,8 @@ navigation_title: "Deployment health warnings"
33
applies_to:
44
deployment:
55
ess: all
6-
ece: all
76
mapped_pages:
87
- https://www.elastic.co/guide/en/cloud/current/ec-deployment-no-op.html
9-
- https://www.elastic.co/guide/en/cloud-enterprise/current/ece-deployment-no-op.html
108
- https://www.elastic.co/guide/en/cloud-heroku/current/ech-deployment-no-op.html
119
---
1210

@@ -18,13 +16,13 @@ The {{ecloud}} [Deployments](https://cloud.elastic.co/deployments) page shows th
1816
:alt: A screen capture of the deployment page showing a typical warning: Deployment health warning: Latest change to {{es}} configuration failed.
1917
:::
2018

21-
**Seeing only one warning?**
19+
**Single warning**
2220

2321
To resolve a single health warning, we recommended first re-applying any pending changes: Select **Edit** in the deployment menu to open the Edit page and then click **Save** without making any changes. This will check all components for pending changes and will apply the changes as needed. This may impact the uptime of clusters which are not [highly available](/deploy-manage/production-guidance/availability-and-resilience/resilience-in-ech.md).
2422

2523
Re-saving the deployment configuration without making any changes is often all that’s needed to resolve a transient health warning on the UI. Saving will redirect you to the {{ech}} deployment [Activity page](/deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) where you can monitor plan completion. Repeat errors should be investigated; for more information refer to [resolving configuration change errors](/troubleshoot/monitoring/node-bootlooping.md).
2624

27-
**Seeing multiple warnings?**
25+
**Multiple warnings**
2826

2927
If multiple health warnings appear for one of your deployments, or if your deployment is unhealthy, we recommend [Getting help](/troubleshoot/index.md) through the Elastic Support Portal.
3028

@@ -34,4 +32,8 @@ If the warning refers to a system change, check the deployment’s [Activity](/d
3432

3533
:::{important}
3634
If you’re using Elastic Cloud Hosted, then you can use AutoOps to monitor your cluster. AutoOps significantly simplifies cluster management with performance recommendations, resource utilization visibility, and real-time issue detection with resolution paths. For more information, refer to [Monitor with AutoOps](/deploy-manage/monitor/autoops.md).
37-
:::
35+
:::
36+
37+
## Additional resources
38+
* [Elastic Cloud Enterprise deployment health warnings](/troubleshoot/deployments/cloud-enterprise/deployment-health-warnings.md)
39+
* [Troubleshooting overview](/troubleshoot/index.md)

troubleshoot/monitoring/node-bootlooping.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,12 @@ navigation_title: "Node bootlooping"
33
applies_to:
44
deployment:
55
ess: all
6-
ece: all
76
mapped_pages:
87
- https://www.elastic.co/guide/en/cloud/current/ec-config-change-errors.html
9-
- https://www.elastic.co/guide/en/cloud-enterprise/current/ece-config-change-errors.html
108
- https://www.elastic.co/guide/en/cloud-heroku/current/ech-config-change-errors.html
119
---
1210

13-
# Troubleshoot node bootlooping [ec-config-change-errors]
11+
# Troubleshoot node bootlooping in {{ech}} [ec-config-change-errors]
1412

1513
When you attempt to apply a configuration change to a deployment, the attempt may fail with an error indicating that the change could not be applied, and deployment resources may be unable to restart. In some cases, bootlooping may result, where the deployment resources cycle through a continual reboot process.
1614

@@ -131,7 +129,7 @@ Configuration change errors can occur when there is insufficient RAM configured
131129
132130
Check the [{{es}} cluster size](/deploy-manage/deploy/elastic-cloud/ec-customize-deployment-components.md#ec-cluster-size) and the [JVM memory pressure indicator](/deploy-manage/monitor/ec-memory-pressure.md) documentation to learn more.
133131
134-
As well, you can read our detailed blog [Managing and troubleshooting {{es}} memory](https://www.elastic.co/blog/managing-and-troubleshooting-elasticsearch-memory).
132+
You can also read our detailed blog [Managing and troubleshooting {{es}} memory](https://www.elastic.co/blog/managing-and-troubleshooting-elasticsearch-memory).
135133
136134
137135
## Existing index [ec-config-change-errors-existing-index]

0 commit comments

Comments
 (0)