Skip to content

Commit 1ebd0b2

Browse files
committed
almost done, landing pages finished and ECE ha doc updated
1 parent 5741985 commit 1ebd0b2

File tree

6 files changed

+77
-97
lines changed

6 files changed

+77
-97
lines changed
Lines changed: 22 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,21 @@
11
---
2+
navigation_title: High availability
23
applies_to:
34
deployment:
45
ece: all
56
mapped_pages:
67
- https://www.elastic.co/guide/en/cloud-enterprise/current/ece-ha.html
78
---
89

9-
# High availability [ece-ha]
10+
# High availability in ECE
1011

11-
Ensuring high availability in {{ece}} (ECE) requires careful planning and implementation across multiple areas, including availability zones, master nodes, replica shards, snapshot backups, and Zookeeper nodes.
12+
Ensuring high availability (HA) in {{ece}} (ECE) requires careful planning and implementation across multiple areas, including availability zones, master nodes, replica shards, snapshot backups, and Zookeeper nodes.
1213

13-
This section describes key considerations and best practices to prevent downtime and data loss at both the ECE platform level and within orchestrated deployments.
14+
::::{note}
15+
This section focuses on ensuring high availability at the ECE platform level, including infrastructure-related considerations and best practices. For deployment level HA and resilience strategies within ECE, refer to [Resilience in ECH and ECE deployments](/deploy-manage/production-guidance/availability-and-resilience/resilience-in-ech.md).
16+
17+
To learn more about running {{es}} and {{kib}} in production environments, refer to the general [production guidance](/deploy-manage/production-guidance.md).
18+
::::
1419

1520
## Availability zones [ece-ece-ha-1-az]
1621

@@ -20,55 +25,30 @@ An availability zone contains resources available to an ECE installation that ar
2025

2126
Planning for a fault-tolerant installation with multiple availability zones means avoiding any single point of failure that could bring down ECE.
2227

23-
The main difference between ECE installations that include two or three availability zones is that three availability zones enable ECE to create clusters with a *tiebreaker*. If you have only two availability zones in total in your installation, no tiebreaker is created.
24-
25-
We recommend that for each deployment you use at least two availability zones for production and three for mission-critical systems. Using more than three availability zones for a deployment is not required nor supported. Availability zones are intended for high availability, not scalability.
26-
27-
::::{warning}
28-
{{es}} clusters that are set up to use only one availability zone are not [highly available](/deploy-manage/production-guidance/availability-and-resilience.md) and are at risk of data loss. To safeguard against data loss, you must use at least two {{ece}} availability zones.
29-
::::
30-
31-
::::{warning}
32-
Increasing the number of zones should not be used to add more resources. The concept of zones is meant for High Availability (2 zones) and Fault Tolerance (3 zones), but neither will work if the cluster relies on the resources from those zones to be operational. The recommendation is to scale up the resources within a single zone until the cluster can take the full load (add some buffer to be prepared for a peak of requests), then scale out by adding additional zones depending on your requirements: 2 zones for High Availability, 3 zones for Fault Tolerance.
33-
::::
34-
28+
The main difference between ECE installations that include two or three availability zones is that three availability zones enable ECE to create {{es}} clusters with a [voting-only tiebreaker](/deploy-manage/distributed-architecture/clusters-nodes-shards/node-roles.md#voting-only-node) instance. If you have only two availability zones in your installation, no tiebreaker can be placed in a third zone, limiting the cluster’s ability to tolerate certain failures.
3529

36-
## Master nodes [ece-ece-ha-2-master-nodes]
30+
## Minimum requirements and recommendations
3731

38-
Tiebreakers are used in distributed clusters to avoid cases of [split brain](https://en.wikipedia.org/wiki/Split-brain_(computing)), where an {{es}} cluster splits into multiple, autonomous parts that continue to handle requests independently of each other, at the risk of affecting cluster consistency and data loss. A split-brain scenario is avoided by making sure that a minimum number of [master-eligible nodes](elasticsearch://reference/elasticsearch/configuration-reference/node-settings.md#master-node) must be present in order for any part of the cluster to elect a master node and accept user requests. To prevent multiple parts of a cluster from being eligible, there must be a [quorum-based majority](/deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-quorums.md) of `(n/2)+1` nodes, where `n` is the number of master-eligible nodes in the cluster. The minimum number of master nodes to reach quorum in a two-node cluster is the same as for a three-node cluster: two nodes must be available.
32+
To maintain high availability, you should deploy at least two ECE hosts for each role—**allocator, constructor, and proxy**—and at least three hosts for the **director** role, which runs ZooKeeper and requires quorum to operate reliably.
3933

40-
When you create a cluster with nodes in two availability zones when a third zone is available, ECE can create a tiebreaker in the third availability zone to help establish quorum in case of loss of an availability zone. The extra tiebreaker node that helps to provide quorum does not have to be a full-fledged and expensive node, as it does not hold data. For example: By tagging allocators hosts in ECE, can you create a cluster with eight nodes each in zones `ece-1a` and `ece-1b`, for a total of 16 nodes, and one tiebreaker node in zone `ece-1c`. This cluster can lose any of the three availability zones whilst maintaining quorum, which means that the cluster can continue to process user requests, provided that there is sufficient capacity available when an availability zone goes down.
34+
In addition, to improve resiliency at the availability zone level, it’s recommended to deploy ECE across three availability zones, with at least two allocators per zone and spare capacity to accommodate instance failover and workload redistribution in case of failures.
4135

42-
By default, each node in an {{es}} cluster is a master-eligible node and a data node. In larger clusters, such as production clusters, it’s a good practice to split the roles, so that master nodes are not handling search or indexing work. When you create a cluster, you can specify to use dedicated [master-eligible nodes](elasticsearch://reference/elasticsearch/configuration-reference/node-settings.md#master-node), one per availability zone.
36+
All Elastic-documented architectures recommend using three availability zones with ECE roles distributed across all zones, refer to [deployment scenarios](./identify-deployment-scenario.md) for examples of small, medium, and large installations.
4337

44-
::::{warning}
45-
Clusters that only have two or fewer master-eligible node are not [highly available](/deploy-manage/production-guidance/availability-and-resilience.md) and are at risk of data loss. You must have [at least three master-eligible nodes](/deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-quorums.md).
38+
::::{important}
39+
Regardless of the resiliency level at the platform level, it’s important to also [configure your deployments for high availability](/deploy-manage/production-guidance/availability-and-resilience/resilience-in-ech.md).
4640
::::
4741

48-
## Replica shards [ece-ece-ha-3-replica-shards]
49-
50-
With multiple {{es}} nodes in multiple availability zones you have the recommended hardware, the next thing to consider is having the recommended index replication. Each index, with the exception of searchable snapshot indexes, should have one or more replicas. Use the index settings API to find any indices with no replica:
51-
52-
```sh
53-
GET _all/_settings/index.number_of_replicas
54-
```
55-
56-
::::{warning}
57-
Indices with no replica, except for [searchable snapshot indices](/deploy-manage/tools/snapshot-and-restore/searchable-snapshots.md), are not highly available. You should use replicas to mitigate against possible data loss.
58-
::::
59-
60-
Refer to [](../../reference-architectures.md) for information about {{es}} architectures.
61-
62-
## Snapshot backups [ece-ece-ha-4-snapshot]
42+
## Zookeeper nodes
6343

64-
You should configure and use [{{es}} snapshots](/deploy-manage/tools/snapshot-and-restore.md). Snapshots provide a way to backup and restore your {{es}} indices. They can be used to copy indices for testing, to recover from failures or accidental deletions, or to migrate data to other deployments. We recommend configuring an [{{ece}}-level repository](../../tools/snapshot-and-restore/cloud-enterprise.md) to apply across all deployments. See [Work with snapshots](../../tools/snapshot-and-restore.md) for more guidance.
44+
Make sure you have three Zookeepers - by default, on the Director host - for your ECE installation. Similar to three {{es}} master nodes can form a quorum, three Zookeepers can form the quorum for high availability purposes. Backing up Zookeeper data directory is also recommended, read [rebuilding a broken Zookeeper quorum](../../../troubleshoot/deployments/cloud-enterprise/rebuilding-broken-zookeeper-quorum.md) for more guidance.
6545

66-
## Furthermore considerations [ece-ece-ha-5-other]
46+
## External resources accessibility
6747

68-
* Make sure you have three Zookeepers - by default, on the Director host - for your ECE installation. Similar to three Elasticsearch master nodes can form a quorum, three Zookeepers can forum the quorum for high availability purposes. Backing up Zookeeper data directory is also recommended, read [this doc](../../../troubleshoot/deployments/cloud-enterprise/rebuilding-broken-zookeeper-quorum.md) for more guidance.
48+
If you’re using a [private Docker registry server](ece-install-offline-with-registry.md) or hosting any [custom bundles and plugins](../../../solutions/search/full-text/search-with-synonyms.md) on a web server, make sure these resources are accessible from all ECE allocators, so they can continue to be accessed in the event of a network partition or zone outage.
6949

70-
* Make sure that if you’re using a [private Docker registry server](ece-install-offline-with-registry.md) or are using any [custom bundles and plugins](../../../solutions/search/full-text/search-with-synonyms.md) hosted on a web server, that these are available to all ECE allocators, so that they can continue to be accessed in the event of a network partition or zone outage.
50+
## Other recommendations
7151

72-
* Don’t delete containers unless guided by Elastic Support or there’s public documentation explicitly describing this as required action. Otherwise, it can cause issues and you may lose access or functionality of your {{ece}} platform. See [Troubleshooting container engines](../../../troubleshoot/deployments/cloud-enterprise/troubleshooting-container-engines.md) for more information.
52+
Avoid deleting containers unless explicitly instructed by Elastic Support or official documentation. Doing so may lead to unexpected issues or loss of access to your {{ece}} platform. For more details, refer to [Troubleshooting container engines](../../../troubleshoot/deployments/cloud-enterprise/troubleshooting-container-engines.md).
7353

74-
If in doubt, please [contact support for help](../../../troubleshoot/deployments/cloud-enterprise/ask-for-help.md).
54+
If you're unsure, don't hesitate to [contact Elastic Support](../../../troubleshoot/deployments/cloud-enterprise/ask-for-help.md).

deploy-manage/production-guidance.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,6 @@ applies_to:
1010
---
1111

1212
% scope: the scope of this page is just a brief introduction to prod guidance at elastic stack level, links to ES and KIB,
13-
14-
% pending (for shaina's review):
15-
% to link to other product's prod guidance when we find them
16-
% pending to link reference architectures / security / authentication (maybe better from the ES / Kib docs)
17-
18-
% other product to link ECE-HA (orchestrator level): /deploy-manage/deploy/cloud-enterprise/ece-ha.md
19-
2013
# Production guidance
2114

2215
Running the {{stack}} in production requires careful planning to ensure resilience, performance, and scalability. This section outlines best practices and recommendations for optimizing {{es}} and {{kib}} in production environments.
@@ -41,6 +34,13 @@ However, certain parts may be relevant only to self-managed clusters, as orchest
4134
**{{serverless-full}}** projects are fully managed and automatically scaled by Elastic. Your project’s performance and general data retention are controlled by the [Search AI Lake settings](/deploy-manage/deploy/elastic-cloud/project-settings.md#elasticsearch-manage-project-search-ai-lake-settings).
4235
::::
4336

44-
## Other products guidance
37+
## Production guidance for other Elastic products
38+
(TBD / Work in progress)
39+
While this section focuses on {{es}} and {{kib}}, the following topics offer production considerations for other Elastic products and components:
40+
41+
* [High availability on ECE orchestrator](/deploy-manage/deploy/cloud-enterprise/ece-ha.md)
42+
43+
* Fleet server scalability: https://www.elastic.co/guide/en/fleet/current/fleet-server-scalability.html
44+
* Deploying and scaling Logstash: https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html
4545

46-
* []()
46+
* APM scalability and performance: https://www.elastic.co/guide/en/observability/current/apm-processing-and-performance.html

deploy-manage/production-guidance/availability-and-resilience.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@ applies_to:
1010
self: all
1111
---
1212

13-
% In the future we could consider merging the ECE/ECH resiliency doc with the content of the 2 original ES docs about resilience (small and large clusters), as some of the topics and concepts overlap
1413
# Design for resilience [high-availability-cluster-design]
1514

1615
Distributed systems like {{es}} are designed to keep working even if some of their components have failed. As long as there are enough well-connected nodes to take over their responsibilities, an {{es}} cluster can continue operating normally if some of its nodes are unavailable or disconnected.
@@ -29,6 +28,8 @@ In the context of {{es}} deployments, an `availability zone`, or simply `zone`,
2928
For example, in {{ech}}, availability zones correspond to the cloud provider’s availability zones. Each of these is typically a physically separate data center, ensuring redundancy and fault tolerance at the infrastructure level.
3029
::::
3130

31+
Learn more about [nodes and shards](/deploy-manage/distributed-architecture/clusters-nodes-shards.md) and [reference architectures](/deploy-manage/reference-architectures.md).
32+
3233
## Cluster sizes
3334

3435
There is a limit to how small a resilient cluster can be. All {{es}} clusters require the following components to function:

0 commit comments

Comments
 (0)