Skip to content

Commit cc5235a

Browse files
Adam Lockeabdonpijpelink
andauthored
Split common cluster issues page into separate pages (#88495) (#88612)
(cherry picked from commit 26cc873) Co-authored-by: Abdon Pijpelink <[email protected]>
1 parent 25a8439 commit cc5235a

8 files changed

+747
-723
lines changed
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
[[circuit-breaker-errors]]
2+
=== Circuit breaker errors
3+
4+
{es} uses <<circuit-breaker,circuit breakers>> to prevent nodes from running out
5+
of JVM heap memory. If Elasticsearch estimates an operation would exceed a
6+
circuit breaker, it stops the operation and returns an error.
7+
8+
By default, the <<parent-circuit-breaker,parent circuit breaker>> triggers at
9+
95% JVM memory usage. To prevent errors, we recommend taking steps to reduce
10+
memory pressure if usage consistently exceeds 85%.
11+
12+
[discrete]
13+
[[diagnose-circuit-breaker-errors]]
14+
==== Diagnose circuit breaker errors
15+
16+
**Error messages**
17+
18+
If a request triggers a circuit breaker, {es} returns an error with a `429` HTTP
19+
status code.
20+
21+
[source,js]
22+
----
23+
{
24+
'error': {
25+
'type': 'circuit_breaking_exception',
26+
'reason': '[parent] Data too large, data for [<http_request>] would be [123848638/118.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [120182112/114.6mb], new bytes reserved: [3666526/3.4mb]',
27+
'bytes_wanted': 123848638,
28+
'bytes_limit': 123273216,
29+
'durability': 'TRANSIENT'
30+
},
31+
'status': 429
32+
}
33+
----
34+
// NOTCONSOLE
35+
36+
{es} also writes circuit breaker errors to <<logging,`elasticsearch.log`>>. This
37+
is helpful when automated processes, such as allocation, trigger a circuit
38+
breaker.
39+
40+
[source,txt]
41+
----
42+
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [num/numGB], which is larger than the limit of [num/numGB], usages [request=0/0b, fielddata=num/numKB, in_flight_requests=num/numGB, accounting=num/numGB]
43+
----
44+
45+
**Check JVM memory usage**
46+
47+
If you've enabled Stack Monitoring, you can view JVM memory usage in {kib}. In
48+
the main menu, click **Stack Monitoring**. On the Stack Monitoring **Overview**
49+
page, click **Nodes**. The **JVM Heap** column lists the current memory usage
50+
for each node.
51+
52+
You can also use the <<cat-nodes,cat nodes API>> to get the current
53+
`heap.percent` for each node.
54+
55+
[source,console]
56+
----
57+
GET _cat/nodes?v=true&h=name,node*,heap*
58+
----
59+
60+
To get the JVM memory usage for each circuit breaker, use the
61+
<<cluster-nodes-stats,node stats API>>.
62+
63+
[source,console]
64+
----
65+
GET _nodes/stats/breaker
66+
----
67+
68+
[discrete]
69+
[[prevent-circuit-breaker-errors]]
70+
==== Prevent circuit breaker errors
71+
72+
**Reduce JVM memory pressure**
73+
74+
High JVM memory pressure often causes circuit breaker errors. See
75+
<<high-jvm-memory-pressure>>.
76+
77+
**Avoid using fielddata on `text` fields**
78+
79+
For high-cardinality `text` fields, fielddata can use a large amount of JVM
80+
memory. To avoid this, {es} disables fielddata on `text` fields by default. If
81+
you've enabled fielddata and triggered the <<fielddata-circuit-breaker,fielddata
82+
circuit breaker>>, consider disabling it and using a `keyword` field instead.
83+
See <<fielddata>>.
84+
85+
**Clear the fieldata cache**
86+
87+
If you've triggered the fielddata circuit breaker and can't disable fielddata,
88+
use the <<indices-clearcache,clear cache API>> to clear the fielddata cache.
89+
This may disrupt any in-flight searches that use fielddata.
90+
91+
[source,console]
92+
----
93+
POST _cache/clear?fielddata=true
94+
----
95+
// TEST[s/^/PUT my-index\n/]
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
[[disk-usage-exceeded]]
2+
=== Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block
3+
4+
This error indicates a data node is critically low on disk space and has reached
5+
the <<cluster-routing-flood-stage,flood-stage disk usage watermark>>. To prevent
6+
a full disk, when a node reaches this watermark, {es} blocks writes to any index
7+
with a shard on the node. If the block affects related system indices, {kib} and
8+
other {stack} features may become unavailable.
9+
10+
{es} will automatically remove the write block when the affected node's disk
11+
usage goes below the <<cluster-routing-watermark-high,high disk watermark>>. To
12+
achieve this, {es} automatically moves some of the affected node's shards to
13+
other nodes in the same data tier.
14+
15+
To verify that shards are moving off the affected node, use the <<cat-shards,cat
16+
shards API>>.
17+
18+
[source,console]
19+
----
20+
GET _cat/shards?v=true
21+
----
22+
23+
If shards remain on the node, use the <<cluster-allocation-explain,cluster
24+
allocation explanation API>> to get an explanation for their allocation status.
25+
26+
[source,console]
27+
----
28+
GET _cluster/allocation/explain
29+
{
30+
"index": "my-index",
31+
"shard": 0,
32+
"primary": false,
33+
"current_node": "my-node"
34+
}
35+
----
36+
// TEST[s/^/PUT my-index\n/]
37+
// TEST[s/"primary": false,/"primary": false/]
38+
// TEST[s/"current_node": "my-node"//]
39+
40+
To immediately restore write operations, you can temporarily increase the disk
41+
watermarks and remove the write block.
42+
43+
[source,console]
44+
----
45+
PUT _cluster/settings
46+
{
47+
"persistent": {
48+
"cluster.routing.allocation.disk.watermark.low": "90%",
49+
"cluster.routing.allocation.disk.watermark.high": "95%",
50+
"cluster.routing.allocation.disk.watermark.flood_stage": "97%"
51+
}
52+
}
53+
54+
PUT */_settings?expand_wildcards=all
55+
{
56+
"index.blocks.read_only_allow_delete": null
57+
}
58+
----
59+
// TEST[s/^/PUT my-index\n/]
60+
61+
As a long-term solution, we recommend you add nodes to the affected data tiers
62+
or upgrade existing nodes to increase disk space. To free up additional disk
63+
space, you can delete unneeded indices using the <<indices-delete-index,delete
64+
index API>>.
65+
66+
[source,console]
67+
----
68+
DELETE my-index
69+
----
70+
// TEST[s/^/PUT my-index\n/]
71+
72+
When a long-term solution is in place, reset or reconfigure the disk watermarks.
73+
74+
[source,console]
75+
----
76+
PUT _cluster/settings
77+
{
78+
"persistent": {
79+
"cluster.routing.allocation.disk.watermark.low": null,
80+
"cluster.routing.allocation.disk.watermark.high": null,
81+
"cluster.routing.allocation.disk.watermark.flood_stage": null
82+
}
83+
}
84+
----
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
[[high-cpu-usage]]
2+
=== High CPU usage
3+
4+
{es} uses <<modules-threadpool,thread pools>> to manage CPU resources for
5+
concurrent operations. High CPU usage typically means one or more thread pools
6+
are running low.
7+
8+
If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
9+
related to the thread pool. For example, if the `search` thread pool is
10+
depleted, {es} will reject search requests until more threads are available.
11+
12+
[discrete]
13+
[[diagnose-high-cpu-usage]]
14+
==== Diagnose high CPU usage
15+
16+
**Check CPU usage**
17+
18+
include::{es-repo-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
19+
20+
**Check hot threads**
21+
22+
If a node has high CPU usage, use the <<cluster-nodes-hot-threads,nodes hot
23+
threads API>> to check for resource-intensive threads running on the node.
24+
25+
[source,console]
26+
----
27+
GET _nodes/my-node,my-other-node/hot_threads
28+
----
29+
// TEST[s/\/my-node,my-other-node//]
30+
31+
This API returns a breakdown of any hot threads in plain text.
32+
33+
[discrete]
34+
[[reduce-cpu-usage]]
35+
==== Reduce CPU usage
36+
37+
The following tips outline the most common causes of high CPU usage and their
38+
solutions.
39+
40+
**Scale your cluster**
41+
42+
Heavy indexing and search loads can deplete smaller thread pools. To better
43+
handle heavy workloads, add more nodes to your cluster or upgrade your existing
44+
nodes to increase capacity.
45+
46+
**Spread out bulk requests**
47+
48+
While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
49+
or <<search-multi-search,multi-search>> requests still require CPU resources. If
50+
possible, submit smaller requests and allow more time between them.
51+
52+
**Cancel long-running searches**
53+
54+
Long-running searches can block threads in the `search` thread pool. To check
55+
for these searches, use the <<tasks,task management API>>.
56+
57+
[source,console]
58+
----
59+
GET _tasks?actions=*search&detailed
60+
----
61+
62+
The response's `description` contains the search request and its queries.
63+
`running_time_in_nanos` shows how long the search has been running.
64+
65+
[source,console-result]
66+
----
67+
{
68+
"nodes" : {
69+
"oTUltX4IQMOUUVeiohTt8A" : {
70+
"name" : "my-node",
71+
"transport_address" : "127.0.0.1:9300",
72+
"host" : "127.0.0.1",
73+
"ip" : "127.0.0.1:9300",
74+
"tasks" : {
75+
"oTUltX4IQMOUUVeiohTt8A:464" : {
76+
"node" : "oTUltX4IQMOUUVeiohTt8A",
77+
"id" : 464,
78+
"type" : "transport",
79+
"action" : "indices:data/read/search",
80+
"description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]",
81+
"start_time_in_millis" : 4081771730000,
82+
"running_time_in_nanos" : 13991383,
83+
"cancellable" : true
84+
}
85+
}
86+
}
87+
}
88+
}
89+
----
90+
// TESTRESPONSE[skip: no way to get tasks]
91+
92+
To cancel a search and free up resources, use the API's `_cancel` endpoint.
93+
94+
[source,console]
95+
----
96+
POST _tasks/oTUltX4IQMOUUVeiohTt8A:464/_cancel
97+
----
98+
99+
For additional tips on how to track and avoid resource-intensive searches, see
100+
<<avoid-expensive-searches,Avoid expensive searches>>.
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
[[high-jvm-memory-pressure]]
2+
=== High JVM memory pressure
3+
4+
High JVM memory usage can degrade cluster performance and trigger
5+
<<circuit-breaker-errors,circuit breaker errors>>. To prevent this, we recommend
6+
taking steps to reduce memory pressure if a node's JVM memory usage consistently
7+
exceeds 85%.
8+
9+
[discrete]
10+
[[diagnose-high-jvm-memory-pressure]]
11+
==== Diagnose high JVM memory pressure
12+
13+
**Check JVM memory pressure**
14+
15+
include::{es-repo-dir}/tab-widgets/jvm-memory-pressure-widget.asciidoc[]
16+
17+
**Check garbage collection logs**
18+
19+
As memory usage increases, garbage collection becomes more frequent and takes
20+
longer. You can track the frequency and length of garbage collection events in
21+
<<logging,`elasticsearch.log`>>. For example, the following event states {es}
22+
spent more than 50% (21 seconds) of the last 40 seconds performing garbage
23+
collection.
24+
25+
[source,log]
26+
----
27+
[timestamp_short_interval_from_last][INFO ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][number] overhead, spent [21s] collecting in the last [40s]
28+
----
29+
30+
[discrete]
31+
[[reduce-jvm-memory-pressure]]
32+
==== Reduce JVM memory pressure
33+
34+
**Reduce your shard count**
35+
36+
Every shard uses memory. In most cases, a small set of large shards uses fewer
37+
resources than many small shards. For tips on reducing your shard count, see
38+
<<size-your-shards>>.
39+
40+
[[avoid-expensive-searches]]
41+
**Avoid expensive searches**
42+
43+
Expensive searches can use large amounts of memory. To better track expensive
44+
searches on your cluster, enable <<index-modules-slowlog,slow logs>>.
45+
46+
Expensive searches may have a large <<paginate-search-results,`size` argument>>,
47+
use aggregations with a large number of buckets, or include
48+
<<query-dsl-allow-expensive-queries,expensive queries>>. To prevent expensive
49+
searches, consider the following setting changes:
50+
51+
* Lower the `size` limit using the
52+
<<index-max-result-window,`index.max_result_window`>> index setting.
53+
54+
* Decrease the maximum number of allowed aggregation buckets using the
55+
<<search-settings-max-buckets,search.max_buckets>> cluster setting.
56+
57+
* Disable expensive queries using the
58+
<<query-dsl-allow-expensive-queries,`search.allow_expensive_queries`>> cluster
59+
setting.
60+
61+
[source,console]
62+
----
63+
PUT _settings
64+
{
65+
"index.max_result_window": 5000
66+
}
67+
68+
PUT _cluster/settings
69+
{
70+
"persistent": {
71+
"search.max_buckets": 20000,
72+
"search.allow_expensive_queries": false
73+
}
74+
}
75+
----
76+
// TEST[s/^/PUT my-index\n/]
77+
78+
**Prevent mapping explosions**
79+
80+
Defining too many fields or nesting fields too deeply can lead to
81+
<<mapping-limit-settings,mapping explosions>> that use large amounts of memory.
82+
To prevent mapping explosions, use the <<mapping-settings-limit,mapping limit
83+
settings>> to limit the number of field mappings.
84+
85+
**Spread out bulk requests**
86+
87+
While more efficient than individual requests, large <<docs-bulk,bulk indexing>>
88+
or <<search-multi-search,multi-search>> requests can still create high JVM
89+
memory pressure. If possible, submit smaller requests and allow more time
90+
between them.
91+
92+
**Upgrade node memory**
93+
94+
Heavy indexing and search loads can cause high JVM memory pressure. To better
95+
handle heavy workloads, upgrade your nodes to increase their memory capacity.

0 commit comments

Comments
 (0)