Skip to content

Commit 42bd3ea

Browse files
stefnestorshainaraskasDaveCTurner
committed
(Docs+) Flush out Resource+Task troubleshooting (elastic#111773)
* (Docs+) Flush out Resource+Task troubleshooting --------- Co-authored-by: shainaraskas <[email protected]> Co-authored-by: David Turner <[email protected]>
1 parent 5b2d861 commit 42bd3ea

File tree

5 files changed

+135
-49
lines changed

5 files changed

+135
-49
lines changed

docs/reference/modules/indices/circuit_breaker.asciidoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,8 @@ an `OutOfMemory` exception which would bring down the node.
175175
To prevent this from happening, a special <<circuit-breaker, circuit breaker>> is used,
176176
which limits the memory allocation during the execution of a <<eql-sequences, sequence>>
177177
query. When the breaker is triggered, an `org.elasticsearch.common.breaker.CircuitBreakingException`
178-
is thrown and a descriptive error message is returned to the user.
178+
is thrown and a descriptive error message including `circuit_breaking_exception`
179+
is returned to the user.
179180

180181
This <<circuit-breaker, circuit breaker>> can be configured using the following settings:
181182

Lines changed: 12 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,20 @@
11
// tag::cloud[]
2-
From your deployment menu, click **Performance**. The page's **CPU Usage** chart
3-
shows your deployment's CPU usage as a percentage.
2+
* (Recommended) Enable {cloud}/ec-monitoring-setup.html[logs and metrics]. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page.
3+
+
4+
You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email.
45
5-
High CPU usage can also deplete your CPU credits. CPU credits let {ess} provide
6-
smaller clusters with a performance boost when needed. The **CPU credits**
7-
chart shows your remaining CPU credits, measured in seconds of CPU time.
6+
* From your deployment menu, view the {cloud}/ec-saas-metrics-accessing.html[**Performance**] page. On this page, you can view two key metrics:
7+
** **CPU usage**: Your deployment's CPU usage, represented as a percentage.
8+
** **CPU credits**: Your remaining CPU credits, measured in seconds of CPU time.
89
9-
You can also use the <<cat-nodes,cat nodes API>> to get the current CPU usage
10-
for each node.
11-
12-
// tag::cpu-usage-cat-nodes[]
13-
[source,console]
14-
----
15-
GET _cat/nodes?v=true&s=cpu:desc
16-
----
17-
18-
The response's `cpu` column contains the current CPU usage as a percentage. The
19-
`name` column contains the node's name.
20-
// end::cpu-usage-cat-nodes[]
10+
{ess} grants {cloud}/ec-vcpu-boost-instance.html[CPU credits] per deployment
11+
to provide smaller clusters with performance boosts when needed. High CPU
12+
usage can deplete these credits, which might lead to {cloud}/ec-scenario_why_is_performance_degrading_over_time.html[performance degradation] and {cloud}/ec-scenario_why_are_my_cluster_response_times_suddenly_so_much_worse.html[increased cluster response times].
2113
2214
// end::cloud[]
2315
2416
// tag::self-managed[]
25-
26-
Use the <<cat-nodes,cat nodes API>> to get the current CPU usage for each node.
27-
28-
include::cpu-usage.asciidoc[tag=cpu-usage-cat-nodes]
29-
17+
* Enable <<monitoring-overview,{es} monitoring>>. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page.
18+
+
19+
You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email.
3020
// end::self-managed[]

docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,29 @@ If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
99
related to the thread pool. For example, if the `search` thread pool is
1010
depleted, {es} will reject search requests until more threads are available.
1111

12+
You might experience high CPU usage if a <<data-tiers,data tier>>, and therefore the nodes assigned to that tier, is experiencing more traffic than other tiers. This imbalance in resource utilization is also known as <<hotspotting,hot spotting>>.
13+
1214
[discrete]
1315
[[diagnose-high-cpu-usage]]
1416
==== Diagnose high CPU usage
1517

1618
**Check CPU usage**
1719

20+
You can check the CPU usage per node using the <<cat-nodes,cat nodes API>>:
21+
22+
// tag::cpu-usage-cat-nodes[]
23+
[source,console]
24+
----
25+
GET _cat/nodes?v=true&s=cpu:desc
26+
----
27+
28+
The response's `cpu` column contains the current CPU usage as a percentage.
29+
The `name` column contains the node's name. Elevated but transient CPU usage is
30+
normal. However, if CPU usage is elevated for an extended duration, it should be
31+
investigated.
32+
33+
To track CPU usage over time, we recommend enabling monitoring:
34+
1835
include::{es-ref-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
1936

2037
**Check hot threads**
@@ -24,11 +41,13 @@ threads API>> to check for resource-intensive threads running on the node.
2441

2542
[source,console]
2643
----
27-
GET _nodes/my-node,my-other-node/hot_threads
44+
GET _nodes/hot_threads
2845
----
2946
// TEST[s/\/my-node,my-other-node//]
3047

31-
This API returns a breakdown of any hot threads in plain text.
48+
This API returns a breakdown of any hot threads in plain text. High CPU usage
49+
frequently correlates to <<task-queue-backlog,a long-running task, or a
50+
backlog of tasks>>.
3251

3352
[discrete]
3453
[[reduce-cpu-usage]]

docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc

Lines changed: 45 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,52 @@ To check the number of rejected tasks for each thread pool, use the
2323

2424
[source,console]
2525
----
26-
GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
26+
GET /_cat/thread_pool?v=true&h=id,name,queue,active,rejected,completed
2727
----
2828

29+
`write` thread pool rejections frequently appear in the erring API and
30+
correlating log as `EsRejectedExecutionException` with either
31+
`QueueResizingEsThreadPoolExecutor` or `queue capacity`.
32+
33+
These errors are often related to <<task-queue-backlog,backlogged tasks>>.
34+
35+
[discrete]
36+
[[check-circuit-breakers]]
37+
==== Check circuit breakers
38+
39+
To check the number of tripped <<circuit-breaker,circuit breakers>>, use the
40+
<<cluster-nodes-stats,node stats API>>.
41+
42+
[source,console]
43+
----
44+
GET /_nodes/stats/breaker
45+
----
46+
47+
These statistics are cumulative from node startup. For more information, see
48+
<<circuit-breaker,circuit breaker errors>>.
49+
50+
[discrete]
51+
[[check-indexing-pressure]]
52+
==== Check indexing pressure
53+
54+
To check the number of <<index-modules-indexing-pressure,indexing pressure>>
55+
rejections, use the <<cluster-nodes-stats,node stats API>>.
56+
57+
[source,console]
58+
----
59+
GET _nodes/stats?human&filter_path=nodes.*.indexing_pressure
60+
----
61+
62+
These stats are cumulative from node startup.
63+
64+
Indexing pressure rejections appear as an
65+
`EsRejectedExecutionException`, and indicate that they were rejected due
66+
to `coordinating_and_primary_bytes`, `coordinating`, `primary`, or `replica`.
67+
68+
These errors are often related to <<task-queue-backlog,backlogged tasks>>,
69+
<<docs-bulk,bulk index>> sizing, or the ingest target's
70+
<<index-modules,`refresh_interval` setting>>.
71+
2972
[discrete]
3073
[[prevent-rejected-requests]]
3174
==== Prevent rejected requests
@@ -34,9 +77,4 @@ GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
3477

3578
If {es} regularly rejects requests and other tasks, your cluster likely has high
3679
CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
37-
<<high-jvm-memory-pressure>>.
38-
39-
**Prevent circuit breaker errors**
40-
41-
If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
42-
for tips on diagnosing and preventing them.
80+
<<high-jvm-memory-pressure>>.

docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc

Lines changed: 55 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,88 @@
11
[[task-queue-backlog]]
22
=== Task queue backlog
33

4-
A backlogged task queue can prevent tasks from completing and
5-
put the cluster into an unhealthy state.
6-
Resource constraints, a large number of tasks being triggered at once,
7-
and long running tasks can all contribute to a backlogged task queue.
4+
A backlogged task queue can prevent tasks from completing and put the cluster
5+
into an unhealthy state. Resource constraints, a large number of tasks being
6+
triggered at once, and long running tasks can all contribute to a backlogged
7+
task queue.
88

99
[discrete]
1010
[[diagnose-task-queue-backlog]]
1111
==== Diagnose a task queue backlog
1212

1313
**Check the thread pool status**
1414

15-
A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.
15+
A <<high-cpu-usage,depleted thread pool>> can result in
16+
<<rejected-requests,rejected requests>>.
1617

17-
You can use the <<cat-thread-pool,cat thread pool API>> to
18-
see the number of active threads in each thread pool and
19-
how many tasks are queued, how many have been rejected, and how many have completed.
18+
Thread pool depletion might be restricted to a specific <<data-tiers,data tier>>. If <<hotspotting,hot spotting>> is occuring, one node might experience depletion faster than other nodes, leading to performance issues and a growing task backlog.
19+
20+
You can use the <<cat-thread-pool,cat thread pool API>> to see the number of
21+
active threads in each thread pool and how many tasks are queued, how many
22+
have been rejected, and how many have completed.
2023

2124
[source,console]
2225
----
2326
GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
2427
----
2528

29+
The `active` and `queue` statistics are instantaneous while the `rejected` and
30+
`completed` statistics are cumulative from node startup.
31+
2632
**Inspect the hot threads on each node**
2733

28-
If a particular thread pool queue is backed up,
29-
you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
30-
to determine if the thread has sufficient
31-
resources to progress and gauge how quickly it is progressing.
34+
If a particular thread pool queue is backed up, you can periodically poll the
35+
<<cluster-nodes-hot-threads,Nodes hot threads>> API to determine if the thread
36+
has sufficient resources to progress and gauge how quickly it is progressing.
3237

3338
[source,console]
3439
----
3540
GET /_nodes/hot_threads
3641
----
3742

38-
**Look for long running tasks**
43+
**Look for long running node tasks**
44+
45+
Long-running tasks can also cause a backlog. You can use the <<tasks,task
46+
management>> API to get information about the node tasks that are running.
47+
Check the `running_time_in_nanos` to identify tasks that are taking an
48+
excessive amount of time to complete.
49+
50+
[source,console]
51+
----
52+
GET /_tasks?pretty=true&human=true&detailed=true
53+
----
3954

40-
Long-running tasks can also cause a backlog.
41-
You can use the <<tasks,task management>> API to get information about the tasks that are running.
42-
Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.
55+
If a particular `action` is suspected, you can filter the tasks further. The most common long-running tasks are <<docs-bulk,bulk index>>- or search-related.
4356

57+
* Filter for <<docs-bulk,bulk index>> actions:
58+
+
4459
[source,console]
4560
----
46-
GET /_tasks?filter_path=nodes.*.tasks
61+
GET /_tasks?human&detailed&actions=indices:data/write/bulk
62+
----
63+
64+
* Filter for search actions:
65+
+
66+
[source,console]
4767
----
68+
GET /_tasks?human&detailed&actions=indices:data/write/search
69+
----
70+
71+
The API response may contain additional tasks columns, including `description` and `header`, which provides the task parameters, target, and requestor. You can use this information to perform further diagnosis.
72+
73+
**Look for long running cluster tasks**
74+
75+
A task backlog might also appear as a delay in synchronizing the cluster state. You
76+
can use the <<cluster-pending,cluster pending tasks API>> to get information
77+
about the pending cluster state sync tasks that are running.
78+
79+
[source,console]
80+
----
81+
GET /_cluster/pending_tasks
82+
----
83+
84+
Check the `timeInQueue` to identify tasks that are taking an excessive amount
85+
of time to complete.
4886

4987
[discrete]
5088
[[resolve-task-queue-backlog]]

0 commit comments

Comments
 (0)