Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
4e92abc
move nodes doc updated
eedugon Jun 5, 2025
7aa65ab
title update
eedugon Jun 5, 2025
f75d235
alert name updated
eedugon Jun 5, 2025
92ea67a
Apply suggestions from code review
eedugon Jun 5, 2025
1f43345
applying other suggestions by reviewers
eedugon Jun 5, 2025
785a285
Update node-moves-outages.md
kunisen Jun 6, 2025
1ddf86c
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 7, 2025
2df3c2d
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 7, 2025
f4e9240
FAQ style removed and minor introductory paragraphs
eedugon Jun 10, 2025
bb9f4f0
titles updated per Kuni suggestion
eedugon Jun 10, 2025
ea69dea
Merge branch 'main' into ech_node_moves_troubleshoot
eedugon Jun 11, 2025
c6a4d23
Apply suggestions from code review
eedugon Jun 12, 2025
1e032a1
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 19, 2025
e2409e3
Update node-moves-outages.md
kunisen Jun 19, 2025
d2d0c5a
Update node-moves-outages.md
kunisen Jun 19, 2025
79654df
Update node-moves-outages.md
kunisen Jun 19, 2025
5bf5a05
Update node-moves-outages.md
kunisen Jun 19, 2025
1f80884
Update node-moves-outages.md
kunisen Jun 19, 2025
b51d31b
Merge branch 'main' into ech_node_moves_troubleshoot
kunisen Jun 19, 2025
6c9e28d
Update node-moves-outages.md
kunisen Jun 19, 2025
0287eb1
Merge branch 'ech_node_moves_troubleshoot' of https://github.com/elas…
kunisen Jun 19, 2025
3aa8c00
Update node-moves-outages.md
kunisen Jun 19, 2025
f7c5fb4
intro message included in code block
eedugon Jun 19, 2025
7e0f38e
admonition title updated
eedugon Jun 19, 2025
546574f
kibana link updated, hopefully fixed
eedugon Jun 19, 2025
7f3bc8d
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
e7edcd5
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
778363f
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
a92dd3e
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
00a48e1
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
a9fe014
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
3c2cd52
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
f330a4d
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
818b947
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
dd6fef8
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
8052c72
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
df3ed0f
Update troubleshoot/monitoring/node-moves-outages.md
kunisen Jun 21, 2025
aa7c843
Merge branch 'main' into ech_node_moves_troubleshoot
kunisen Jun 21, 2025
c4dc91a
Update troubleshoot/monitoring/node-moves-outages.md
florent-leborgne Jun 23, 2025
7f9a114
Apply suggestions from code review
shainaraskas Jun 23, 2025
22745aa
Merge branch 'main' into ech_node_moves_troubleshoot
shainaraskas Jun 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
92 changes: 77 additions & 15 deletions troubleshoot/monitoring/node-moves-outages.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
navigation_title: Node moves and outages
navigation_title: Node moves and Hardware failures
mapped_pages:
- https://www.elastic.co/guide/en/cloud/current/ec-deployment-node-move.html
applies_to:
Expand All @@ -9,33 +9,95 @@ products:
- id: cloud-hosted
---

# Troubleshoot node moves and outages [ec-deployment-node-move]
# Understanding node moves and system maintenance [ec-deployment-node-move]

To ensure that your nodes are located on healthy hosts, we vacate nodes to perform routine system maintenance or to remove a host with hardware issues from service.
To ensure that your deployment nodes are located on healthy hosts, we vacate nodes to perform essential system maintenance or to remove a host with hardware issues from service.

All major scheduled maintenance and incidents can be found on the Elastic [status page](https://status.elastic.co/). You can subscribe to that page to be notified about updates.

If events on your deployment don’t correlate to any items listed on the status page, the events are due to minor routine maintenance performed on only a subset of {{ech}} deployments.
If events on your deployment don’t correlate to any items listed on the status page, the events are due to minor essential maintenance performed on only a subset of {{ech}} deployments.

**What is the impact?**
This document explains the "`Move nodes off of allocator...`" message that appears on the [activity page](../../deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) in {{ech}} deployments, helping you understand its meaning, implications, and what to expect.

During the routine system maintenance, having replicas and multiple availability zones ensures minimal interruption to your service. When nodes are vacated, as long as you have high availability, all search and indexing requests are expected to work within the reduced capacity until the node is back to normal.
![Move nodes off allocator](images/move_nodes_ech_allocator.jpeg)

**How can I be notified when a node is changed?**
::::{note}
You can [configure email notifications](#email) to be alerted when this situation occurs.
::::

To receive an email when nodes are added or removed from your deployment:
## Possible causes and impact [ec-node-host-outages]

1. Follow the first five steps in [Getting notified about deployment health issues](../../deploy-manage/monitor/monitoring-data/configure-stack-monitoring-alerts.md).
2. At Step 6, to choose the alert type for when a node is changed, select **CLUSTER HEALTH** → **Nodes changed** → **Edit alert**.
Potential causes of system maintenance include, but not limited to, situations like:

::::{note}
If you have only one master node in your cluster, during the master node vacate no notification will be sent. Kibana needs to communicate with the master node in order to send a notification. One way to avoid this is by shipping your deployment metrics to a dedicated monitoring cluster, which you can configure in Step 2 of [Getting notified about deployment health issues](../../deploy-manage/monitor/monitoring-data/configure-stack-monitoring-alerts.md) when you enable logging and monitoring.
* A host where the Cloud Service Provider (CSP), like AWS, GCP, or Azure, has reported upcoming hardware deprecation or identified issues requiring remediation.
* A host that has been abruptly terminated by the Cloud Service Provider (CSP) because of an underlying problem with their infrastructure.
* Mandatory underlying host OS patch/upgrade activities for security/compliance reasons.
* Other scheduled maintenance that can be found on the [Elastic status page](https://status.elastic.co/).

Depending on the cause of the node movement, the behavior and expectations differ.

* During planned operations, such as hardware upgrades or host patches, the system attempts to gracefully move the node to another host before shutting down the original one. This process allows shard relocation to complete ahead of time, minimizing any potential disruption.

* In contrast, if a node’s host experiences an unexpected outage, the system automatically vacates the node and displays a related `Don't attempt to gracefully move shards` message on the [activity page](../../deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md). Because the node and its data path are already unavailable, the system skips its check to ensure the node’s shards have been moved before shutting down the node.

## Frequently Asked Questions (FAQs) [faq]

### Will it cause an outage to my deployment?

Unless overridden or unable, the system will automatically recover the vacated node’s data automatically from replicas or snapshots. If your cluster has [high availability (HA)](/deploy-manage/deploy/elastic-cloud/elastic-cloud-hosted-planning.md#ec-ha) configured, all search and indexing requests should continue to work within the reduced capacity as the node is replaced.

Overall, having replicas and multiple availability zones helps minimize service interruption.

### Under what circumstances could this result in data loss?

The system maintenance process always attempts to recover the vacated node's data from replicas or snapshots. However, if the deployment is not configured with HA, including replica shards, the maintenance process may not be able to recover the data from the vacated node.

To minimize this risk, ensure your deployment follows the [high availability best practices](/deploy-manage/deploy/elastic-cloud/elastic-cloud-hosted-planning.md#ec-ha), which recommend:
- Using at least two availability zones for production systems (three for mission-critical systems).
- Configuring one or more replicas for each index (except for searchable snapshot indexes).

As long as these recommendations are followed, system maintenance processes should not impact the availability of the data in the deployment.

### Could such a system maintenance be avoided or skipped?

No, these are essential tasks that cannot be delayed nor avoided.

### I configured multiple availability zones, but I still see data loss during system maintenance. Why?

Configuring multiple availability zones helps your deployment remain available for indexing and search requests if one zone becomes unavailable. However, this alone does not guarantee data availability. To ensure that your data remains accessible, indices must be configured with [replica shards](/deploy-manage/distributed-architecture/clusters-nodes-shards.md).

If an index has no replica shards and its primary shard is located on a node that must be vacated, data loss may occur if the system is unable to move the node gracefully during the maintenance activity.

### What about service degradation or service outage during the system maintenance?

As mentioned in [](/deploy-manage/deploy/elastic-cloud/elastic-cloud-hosted-planning.md#ec-ha):

::::{admonition} Availability zones and performance
Increasing the number of zones should not be used to add more resources. The concept of zones is meant for High Availability (2 zones) and Fault Tolerance (3 zones), but neither will work if the cluster relies on the resources from those zones to be operational.

The recommendation is to **scale up the resources within a single zone until the cluster can take the full load (add some buffer to be prepared for a peak of requests)**, then scale out by adding additional zones depending on your requirements: 2 zones for High Availability, 3 zones for Fault Tolerance.
::::

At a minimum, you should size your deployment to tolerate the temporary loss of one node in order to avoid single points of failure and ensure proper HA. For critical systems, ensure that the deployment can continue operating even in the event of losing an entire availability zone.

### What if I still have questions after reviewing this document and its references?

Please reach out to [Elastic support](/troubleshoot/index.md#contact-us) for help.

## Node host outages [ec-node-host-outages]
### How can I be notified when a node is changed? [email]

If a node’s host experiences an outage, the system automatically vacates the node and displays a related `Don't attempt to gracefully move shards` message on the [**Activity**](../../deploy-manage/deploy/elastic-cloud/keep-track-of-deployment-activity.md) page. Since the node is unavailable, the system skips checks that ensure the node’s shards have been moved before shutting down the node.
To receive an email when nodes are added or removed from your deployment:

1. Enable [Stack monitoring](/deploy-manage/monitor/stack-monitoring/ece-ech-stack-monitoring.md#enable-logging-and-monitoring-steps) (logs and metrics) on your deployment. Only metrics collection is required for these notifications to work.

In the deployment used as the destination of Stack monitoring:

2. Create [Stack monitoring default rules](/deploy-manage/monitor/monitoring-data/configure-stack-monitoring-alerts.md#_create_default_rules).

Unless overridden or unable, the system will automatically recover the vacated node’s data from replicas or snapshots. If your cluster has high availability, all search and indexing requests should work within the reduced capacity until the node is back to normal.
3. (Optional) Configure an email [connector](/deploy-manage/manage-connectors.md). If you prefer, use the pre-configured `Elastic-CLoud-SMTP`.

4. Edit the rule **Cluster alerting** → **{{es}} nodes changed** and select the email connector.

::::{note}
If you have only one master node in your cluster, during the master node vacate no notification will be sent. Kibana needs to communicate with the master node in order to send a notification. One way to avoid this is by shipping your deployment metrics to a dedicated monitoring cluster when you enable logging and monitoring.
::::
Loading