Skip to content

Commit 4c5b154

Browse files
Added additional entries for troubleshooting unhealthy cluster
Reordered "Re-enable shard allocation" because not as common as other causes Added additional causes of yellow statuses Changed watermark commadn to include high and low watermark so users can make their cluster operate once again.
1 parent d4b391d commit 4c5b154

File tree

1 file changed

+42
-22
lines changed

1 file changed

+42
-22
lines changed

docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc

Lines changed: 42 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -74,35 +74,31 @@ A shard can become unassigned for several reasons. The following tips outline th
7474
most common causes and their solutions.
7575

7676
[discrete]
77-
[[fix-cluster-status-reenable-allocation]]
78-
===== Re-enable shard allocation
77+
[[fix-cluster-status-only-one-node]]
78+
===== Single Node Cluster
7979

80-
You typically disable allocation during a <<restart-cluster,restart>> or other
81-
cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
82-
be unable to assign shards. To re-enable allocation, reset the
83-
`cluster.routing.allocation.enable` cluster setting.
80+
{es} will never assign a replica to the same node as the primary shard. If you only have one node it is expected for your cluster to indicate yellow. If you prefer it to be green, then change the <<dynamic-index-number-of-replicas,num_of_replicas>> on each index to be 0.
8481

85-
[source,console]
86-
----
87-
PUT _cluster/settings
88-
{
89-
"persistent" : {
90-
"cluster.routing.allocation.enable" : null
91-
}
92-
}
93-
----
94-
95-
See https://www.youtube.com/watch?v=MiKKUdZvwnI[this video] for walkthrough of troubleshooting "no allocations are allowed".
82+
Similarly if the number of replicas is equal to or exceeds the number of nodes, then it will not be possible to allocate one or more of the shards for the same reason.
9683

9784
[discrete]
9885
[[fix-cluster-status-recover-nodes]]
9986
===== Recover lost nodes
10087

10188
Shards often become unassigned when a data node leaves the cluster. This can
102-
occur for several reasons, ranging from connectivity issues to hardware failure.
89+
occur for several reasons.
90+
91+
* If you manually restart a node, then it will temporarily cause an unhealthy cluster until the node has recovered.
92+
93+
* If you have a node that is overloaded or has stopped operating for any reason, then it will temporarily cause an unhealthy cluster. Nodes may disconnect because of prolonged garbage collection (GC) pauses, which can result from "out of memory" errors or high memory usage due to intensive search operations. See <<fix-cluster-status-jvm,Reduce JVM memory pressure>> for more JVM related issues.
94+
95+
* If nodes cannot reliably communicate due to networking issues, they may lose contact with one another. This can cause shards to become out of sync. You can often identify this issue by checking the logs for repeated messages about nodes leaving and rejoining the cluster.
96+
10397
After you resolve the issue and recover the node, it will rejoin the cluster.
10498
{es} will then automatically allocate any unassigned shards.
10599

100+
You can monitor this process by <<cluster-health,checking your cluster health>>. You will see that the number of unallocated shards progressively reduces until green status is reached.
101+
106102
To avoid wasting resources on temporary issues, {es} <<delayed-allocation,delays
107103
allocation>> by one minute by default. If you've recovered a node and don’t want
108104
to wait for the delay period, you can call the <<cluster-reroute,cluster reroute
@@ -151,7 +147,7 @@ replica, it remains unassigned. To fix this, you can:
151147

152148
* Change the `index.number_of_replicas` index setting to reduce the number of
153149
replicas for each primary shard. We recommend keeping at least one replica per
154-
primary.
150+
primary for high availability.
155151

156152
[source,console]
157153
----
@@ -162,7 +158,6 @@ PUT _settings
162158
----
163159
// TEST[s/^/PUT my-index\n/]
164160

165-
166161
[discrete]
167162
[[fix-cluster-status-disk-space]]
168163
===== Free up or increase disk space
@@ -183,6 +178,8 @@ If your nodes are running low on disk space, you have a few options:
183178

184179
* Upgrade your nodes to increase disk space.
185180

181+
* Add more nodes to the cluster.
182+
186183
* Delete unneeded indices to free up space. If you use {ilm-init}, you can
187184
update your lifecycle policy to use <<ilm-searchable-snapshot,searchable
188185
snapshots>> or add a delete phase. If you no longer need to search the data, you
@@ -215,11 +212,34 @@ watermark or set it to an explicit byte value.
215212
PUT _cluster/settings
216213
{
217214
"persistent": {
218-
"cluster.routing.allocation.disk.watermark.low": "30gb"
215+
"cluster.routing.allocation.disk.watermark.low": "90%",
216+
"cluster.routing.allocation.disk.watermark.high": "95%"
219217
}
220218
}
221219
----
222220
// TEST[s/"30gb"/null/]
221+
**It is important to note that this is usually a temporary solution and may cause instability if the disk space is not freed up.**
222+
223+
[discrete]
224+
[[fix-cluster-status-reenable-allocation]]
225+
===== Re-enable shard allocation
226+
227+
You typically disable allocation during a <<restart-cluster,restart>> or other
228+
cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
229+
be unable to assign shards. To re-enable allocation, reset the
230+
`cluster.routing.allocation.enable` cluster setting.
231+
232+
[source,console]
233+
----
234+
PUT _cluster/settings
235+
{
236+
"persistent" : {
237+
"cluster.routing.allocation.enable" : null
238+
}
239+
}
240+
----
241+
242+
See https://www.youtube.com/watch?v=MiKKUdZvwnI[this video] for walkthrough of troubleshooting "no allocations are allowed".
223243

224244
[discrete]
225245
[[fix-cluster-status-jvm]]
@@ -267,4 +287,4 @@ POST _cluster/reroute?metric=none
267287
// TEST[s/^/PUT my-index\n/]
268288
// TEST[catch:bad_request]
269289

270-
See https://www.youtube.com/watch?v=6OAg9IyXFO4[this video] for a walkthrough of troubleshooting `no_valid_shard_copy`.
290+
See https://www.youtube.com/watch?v=6OAg9IyXFO4[this video] for a walkthrough of troubleshooting `no_valid_shard_copy`.

0 commit comments

Comments
 (0)