Skip to content

Commit 0426e1f

Browse files
authored
(API) Cluster Health report unassigned_primary_shards (#111727) (#112024)
This PR adds a count of currently unassigned primary shards to both the `/_cat/health` and `/_cluster/health` endpoints. This is to aid cluster administrators in estimating the time remaining for a cluster to go from RED to YELLOW status as per enchancement request #111727. Tests and doc updates are in place with this PR and manual testing with `./gradlew run` has been conducted on the endpoints to ensure correct output. ## Known Limitations * Testing * Due to limitations in the YAML REST test framework skip functionality, YAML REST tests for this endpoint are disabled when running a mixed version cluster by using a cluster version number synthetic feature to skip when any member of the cluster is not at a version greater than when this change is due to be introduced
1 parent 1acba13 commit 0426e1f

File tree

18 files changed

+365
-111
lines changed

18 files changed

+365
-111
lines changed

docs/changelog/112024.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 112024
2+
summary: (API) Cluster Health report `unassigned_primary_shards`
3+
area: Health
4+
type: enhancement
5+
issues: []

docs/reference/cat/health.asciidoc

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66

77
[IMPORTANT]
88
====
9-
cat APIs are only intended for human consumption using the command line or {kib}
10-
console. They are _not_ intended for use by applications. For application
9+
cat APIs are only intended for human consumption using the command line or {kib}
10+
console. They are _not_ intended for use by applications. For application
1111
consumption, use the <<cluster-health,cluster health API>>.
1212
====
1313

@@ -87,8 +87,8 @@ The API returns the following response:
8787

8888
[source,txt]
8989
--------------------------------------------------
90-
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
91-
1475871424 16:17:04 elasticsearch green 1 1 1 1 0 0 0 0 - 100.0%
90+
epoch timestamp cluster status node.total node.data shards pri relo init unassign unassign.pri pending_tasks max_task_wait_time active_shards_percent
91+
1475871424 16:17:04 elasticsearch green 1 1 1 1 0 0 0 0 0 - 100.0%
9292
--------------------------------------------------
9393
// TESTRESPONSE[s/1475871424 16:17:04/\\d+ \\d+:\\d+:\\d+/]
9494
// TESTRESPONSE[s/elasticsearch/[^ ]+/ s/0 -/\\d+ (-|\\d+(\\.\\d+)?[ms]+)/ non_json]
@@ -107,11 +107,13 @@ The API returns the following response:
107107

108108
[source,txt]
109109
--------------------------------------------------
110-
cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
111-
elasticsearch green 1 1 1 1 0 0 0 0 - 100.0%
110+
cluster status node.total node.data shards pri relo init unassign unassign.pri pending_tasks max_task_wait_time active_shards_percent
111+
elasticsearch green 1 1 1 1 0 0 0 0 0 - 100.0%
112112
--------------------------------------------------
113113
// TESTRESPONSE[s/elasticsearch/[^ ]+/ s/0 -/\\d+ (-|\\d+(\\.\\d+)?[ms]+)/ non_json]
114114

115+
**Note**: The reported number of unassigned primary shards may be lower than the true value if your cluster contains nodes running a version below 8.16. For a more accurate count in this scenario, please use the <<cluster-health,cluster health API>>.
116+
115117
[[cat-health-api-example-across-nodes]]
116118
===== Example across nodes
117119
You can use the cat health API to verify the health of a cluster across nodes.
@@ -121,11 +123,11 @@ For example:
121123
--------------------------------------------------
122124
% pssh -i -h list.of.cluster.hosts curl -s localhost:9200/_cat/health
123125
[1] 20:20:52 [SUCCESS] es3.vm
124-
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0
126+
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0 0
125127
[2] 20:20:52 [SUCCESS] es1.vm
126-
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0
128+
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0 0
127129
[3] 20:20:52 [SUCCESS] es2.vm
128-
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0
130+
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0 0
129131
--------------------------------------------------
130132
// NOTCONSOLE
131133

@@ -138,15 +140,15 @@ in a delayed loop. For example:
138140
[source,sh]
139141
--------------------------------------------------
140142
% while true; do curl localhost:9200/_cat/health; sleep 120; done
141-
1384309446 18:24:06 foo red 3 3 20 20 0 0 1812 0
142-
1384309566 18:26:06 foo yellow 3 3 950 916 0 12 870 0
143-
1384309686 18:28:06 foo yellow 3 3 1328 916 0 12 492 0
144-
1384309806 18:30:06 foo green 3 3 1832 916 4 0 0
143+
1384309446 18:24:06 foo red 3 3 20 20 0 0 1812 1121 0
144+
1384309566 18:26:06 foo yellow 3 3 950 916 0 12 870 421 0
145+
1384309686 18:28:06 foo yellow 3 3 1328 916 0 12 492 301 0
146+
1384309806 18:30:06 foo green 3 3 1832 916 4 0 0 0
145147
^C
146148
--------------------------------------------------
147149
// NOTCONSOLE
148150

149151
In this example, the recovery took roughly six minutes, from `18:24:06` to
150152
`18:30:06`. If this recovery took hours, you could continue to monitor the
151153
number of `UNASSIGNED` shards, which should drop. If the number of `UNASSIGNED`
152-
shards remains static, it would indicate an issue with the cluster recovery.
154+
shards remains static, it would indicate an issue with the cluster recovery.

docs/reference/cluster/health.asciidoc

Lines changed: 33 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -20,22 +20,22 @@ Returns the health status of a cluster.
2020
[[cluster-health-api-desc]]
2121
==== {api-description-title}
2222

23-
The cluster health API returns a simple status on the health of the
23+
The cluster health API returns a simple status on the health of the
2424
cluster. You can also use the API to get the health status of only specified
2525
data streams and indices. For data streams, the API retrieves the health status
2626
of the stream's backing indices.
2727

28-
The cluster health status is: `green`, `yellow` or `red`. On the shard level, a
29-
`red` status indicates that the specific shard is not allocated in the cluster,
30-
`yellow` means that the primary shard is allocated but replicas are not, and
31-
`green` means that all shards are allocated. The index level status is
32-
controlled by the worst shard status. The cluster status is controlled by the
28+
The cluster health status is: `green`, `yellow` or `red`. On the shard level, a
29+
`red` status indicates that the specific shard is not allocated in the cluster,
30+
`yellow` means that the primary shard is allocated but replicas are not, and
31+
`green` means that all shards are allocated. The index level status is
32+
controlled by the worst shard status. The cluster status is controlled by the
3333
worst index status.
3434

35-
One of the main benefits of the API is the ability to wait until the cluster
36-
reaches a certain high water-mark health level. For example, the following will
37-
wait for 50 seconds for the cluster to reach the `yellow` level (if it reaches
38-
the `green` or `yellow` status before 50 seconds elapse, it will return at that
35+
One of the main benefits of the API is the ability to wait until the cluster
36+
reaches a certain high water-mark health level. For example, the following will
37+
wait for 50 seconds for the cluster to reach the `yellow` level (if it reaches
38+
the `green` or `yellow` status before 50 seconds elapse, it will return at that
3939
point):
4040

4141
[source,console]
@@ -58,31 +58,31 @@ To target all data streams and indices in a cluster, omit this parameter or use
5858
==== {api-query-parms-title}
5959

6060
`level`::
61-
(Optional, string) Can be one of `cluster`, `indices` or `shards`. Controls
61+
(Optional, string) Can be one of `cluster`, `indices` or `shards`. Controls
6262
the details level of the health information returned. Defaults to `cluster`.
63-
63+
6464
include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=local]
65-
65+
6666
include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=timeoutparms]
6767

6868
`wait_for_active_shards`::
69-
(Optional, string) A number controlling to how many active shards to wait
70-
for, `all` to wait for all shards in the cluster to be active, or `0` to not
69+
(Optional, string) A number controlling to how many active shards to wait
70+
for, `all` to wait for all shards in the cluster to be active, or `0` to not
7171
wait. Defaults to `0`.
72-
72+
7373
`wait_for_events`::
74-
(Optional, string) Can be one of `immediate`, `urgent`, `high`, `normal`,
75-
`low`, `languid`. Wait until all currently queued events with the given
74+
(Optional, string) Can be one of `immediate`, `urgent`, `high`, `normal`,
75+
`low`, `languid`. Wait until all currently queued events with the given
7676
priority are processed.
7777

7878
`wait_for_no_initializing_shards`::
79-
(Optional, Boolean) A boolean value which controls whether to wait (until
80-
the timeout provided) for the cluster to have no shard initializations.
79+
(Optional, Boolean) A boolean value which controls whether to wait (until
80+
the timeout provided) for the cluster to have no shard initializations.
8181
Defaults to false, which means it will not wait for initializing shards.
8282

8383
`wait_for_no_relocating_shards`::
84-
(Optional, Boolean) A boolean value which controls whether to wait (until
85-
the timeout provided) for the cluster to have no shard relocations. Defaults
84+
(Optional, Boolean) A boolean value which controls whether to wait (until
85+
the timeout provided) for the cluster to have no shard relocations. Defaults
8686
to false, which means it will not wait for relocating shards.
8787

8888
`wait_for_nodes`::
@@ -92,7 +92,7 @@ include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=timeoutparms]
9292
`lt(N)` notation.
9393

9494
`wait_for_status`::
95-
(Optional, string) One of `green`, `yellow` or `red`. Will wait (until the
95+
(Optional, string) One of `green`, `yellow` or `red`. Will wait (until the
9696
timeout provided) until the status of the cluster changes to the one
9797
provided or better, i.e. `green` > `yellow` > `red`. By default, will not
9898
wait for any status.
@@ -107,7 +107,7 @@ include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=timeoutparms]
107107
include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=cluster-health-status]
108108

109109
`timed_out`::
110-
(Boolean) If `false` the response returned within the period of
110+
(Boolean) If `false` the response returned within the period of
111111
time that is specified by the `timeout` parameter (`30s` by default).
112112

113113
`number_of_nodes`::
@@ -131,23 +131,26 @@ include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=cluster-health-status]
131131
`unassigned_shards`::
132132
(integer) The number of shards that are not allocated.
133133

134+
`unassigned_primary_shards`::
135+
(integer) The number of shards that are primary but not allocated. **Note**: This number may be lower than the true value if your cluster contains nodes running a version below 8.16. For a more accurate count in this scenario, please use the <<cluster-health,cluster health API>>.
136+
134137
`delayed_unassigned_shards`::
135-
(integer) The number of shards whose allocation has been delayed by the
138+
(integer) The number of shards whose allocation has been delayed by the
136139
timeout settings.
137140

138141
`number_of_pending_tasks`::
139-
(integer) The number of cluster-level changes that have not yet been
142+
(integer) The number of cluster-level changes that have not yet been
140143
executed.
141144

142145
`number_of_in_flight_fetch`::
143146
(integer) The number of unfinished fetches.
144147

145148
`task_max_waiting_in_queue_millis`::
146-
(integer) The time expressed in milliseconds since the earliest initiated task
149+
(integer) The time expressed in milliseconds since the earliest initiated task
147150
is waiting for being performed.
148151

149152
`active_shards_percent_as_number`::
150-
(float) The ratio of active shards in the cluster expressed as a percentage.
153+
(float) The ratio of active shards in the cluster expressed as a percentage.
151154

152155
[[cluster-health-api-example]]
153156
==== {api-examples-title}
@@ -158,7 +161,7 @@ GET _cluster/health
158161
--------------------------------------------------
159162
// TEST[s/^/PUT test1\n/]
160163

161-
The API returns the following response in case of a quiet single node cluster
164+
The API returns the following response in case of a quiet single node cluster
162165
with a single index with one shard and one replica:
163166

164167
[source,console-result]
@@ -174,6 +177,7 @@ with a single index with one shard and one replica:
174177
"relocating_shards" : 0,
175178
"initializing_shards" : 0,
176179
"unassigned_shards" : 1,
180+
"unassigned_primary_shards" : 0,
177181
"delayed_unassigned_shards": 0,
178182
"number_of_pending_tasks" : 0,
179183
"number_of_in_flight_fetch": 0,

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/cat.health/10_basic.yml

Lines changed: 38 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,45 @@
11
---
22
"Help":
3+
- requires:
4+
capabilities:
5+
- method: GET
6+
path: /_cluster/health
7+
capabilities: [ unassigned_pri_shard_count ]
8+
test_runner_features: capabilities
9+
reason: Capability required to run test
310
- do:
411
cat.health:
512
help: true
613

714
- match:
815
$body: |
9-
/^ epoch .+ \n
10-
timestamp .+ \n
11-
cluster .+ \n
12-
status .+ \n
13-
node.total .+ \n
14-
node.data .+ \n
15-
shards .+ \n
16-
pri .+ \n
17-
relo .+ \n
18-
init .+ \n
19-
unassign .+ \n
20-
pending_tasks .+ \n
21-
max_task_wait_time .+ \n
22-
active_shards_percent .+ \n
23-
16+
/^ epoch .+\n
17+
timestamp .+\n
18+
cluster .+\n
19+
status .+\n
20+
node.total .+\n
21+
node.data .+\n
22+
shards .+\n
23+
pri .+\n
24+
relo .+\n
25+
init .+\n
26+
unassign .+\n
27+
unassign.pri .+\n
28+
pending_tasks .+\n
29+
max_task_wait_time .+\n
30+
active_shards_percent .+\n
2431
$/
2532
2633
2734
---
2835
"Empty cluster":
29-
36+
- requires:
37+
capabilities:
38+
- method: GET
39+
path: /_cluster/health
40+
capabilities: [ unassigned_pri_shard_count ]
41+
test_runner_features: capabilities
42+
reason: Capability required to run test
3043
- do:
3144
cat.health: {}
3245

@@ -44,6 +57,7 @@
4457
\d+ \s+ # relo
4558
\d+ \s+ # init
4659
\d+ \s+ # unassign
60+
\d+ \s+ # unassign.pri
4761
\d+ \s+ # pending_tasks
4862
(-|\d+(?:[.]\d+)?m?s) \s+ # max task waiting time
4963
\d+\.\d+% # active shards percent
@@ -54,7 +68,13 @@
5468
5569
---
5670
"With ts parameter":
57-
71+
- requires:
72+
capabilities:
73+
- method: GET
74+
path: /_cluster/health
75+
capabilities: [ unassigned_pri_shard_count ]
76+
test_runner_features: capabilities
77+
reason: Capability required to run test
5878
- do:
5979
cat.health:
6080
ts: false
@@ -71,6 +91,7 @@
7191
\d+ \s+ # relo
7292
\d+ \s+ # init
7393
\d+ \s+ # unassign
94+
\d+ \s+ # unassign.pri
7495
\d+ \s+ # pending_tasks
7596
(-|\d+(?:[.]\d+)?m?s) \s+ # max task waiting time
7697
\d+\.\d+% # active shards percent

0 commit comments

Comments
 (0)