Skip to content

Commit 696ee80

Browse files
marciwshainaraskas
andauthored
Revise content to match new troubleshooting guidelines (#118033)
* Revise to match new guidelines * Address review suggestions and comments * Apply suggestions from review Co-authored-by: shainaraskas <[email protected]> * Apply suggestions from review Co-authored-by: shainaraskas <[email protected]> * Apply suggestions from review Co-authored-by: shainaraskas <[email protected]> * Apply suggestions from review --------- Co-authored-by: shainaraskas <[email protected]>
1 parent e43cdf7 commit 696ee80

File tree

1 file changed

+88
-42
lines changed

1 file changed

+88
-42
lines changed
Lines changed: 88 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,103 +1,149 @@
11
[[task-queue-backlog]]
2-
=== Task queue backlog
2+
=== Backlogged task queue
33

4-
A backlogged task queue can prevent tasks from completing and put the cluster
5-
into an unhealthy state. Resource constraints, a large number of tasks being
6-
triggered at once, and long running tasks can all contribute to a backlogged
7-
task queue.
4+
*******************************
5+
*Product:* Elasticsearch +
6+
*Deployment type:* Elastic Cloud Enterprise, Elastic Cloud Hosted, Elastic Cloud on Kubernetes, Elastic Self-Managed +
7+
*Versions:* All
8+
*******************************
9+
10+
A backlogged task queue can prevent tasks from completing and lead to an
11+
unhealthy cluster state. Contributing factors include resource constraints,
12+
a large number of tasks triggered at once, and long-running tasks.
813

914
[discrete]
1015
[[diagnose-task-queue-backlog]]
11-
==== Diagnose a task queue backlog
16+
==== Diagnose a backlogged task queue
17+
18+
To identify the cause of the backlog, try these diagnostic actions.
1219

13-
**Check the thread pool status**
20+
* <<diagnose-task-queue-thread-pool>>
21+
* <<diagnose-task-queue-hot-thread>>
22+
* <<diagnose-task-queue-long-running-node-tasks>>
23+
* <<diagnose-task-queue-long-running-cluster-tasks>>
24+
25+
[discrete]
26+
[[diagnose-task-queue-thread-pool]]
27+
===== Check the thread pool status
1428

1529
A <<high-cpu-usage,depleted thread pool>> can result in
1630
<<rejected-requests,rejected requests>>.
1731

18-
Thread pool depletion might be restricted to a specific <<data-tiers,data tier>>. If <<hotspotting,hot spotting>> is occuring, one node might experience depletion faster than other nodes, leading to performance issues and a growing task backlog.
19-
20-
You can use the <<cat-thread-pool,cat thread pool API>> to see the number of
21-
active threads in each thread pool and how many tasks are queued, how many
22-
have been rejected, and how many have completed.
32+
Use the <<cat-thread-pool,cat thread pool API>> to monitor
33+
active threads, queued tasks, rejections, and completed tasks:
2334

2435
[source,console]
2536
----
2637
GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
2738
----
2839

29-
The `active` and `queue` statistics are instantaneous while the `rejected` and
30-
`completed` statistics are cumulative from node startup.
40+
* Look for high `active` and `queue` metrics, which indicate potential bottlenecks
41+
and opportunities to <<reduce-cpu-usage,reduce CPU usage>>.
42+
* Determine whether thread pool issues are specific to a <<data-tiers,data tier>>.
43+
* Check whether a specific node's thread pool is depleting faster than others. This
44+
might indicate <<resolve-task-queue-backlog-hotspotting, hot spotting>>.
3145

32-
**Inspect the hot threads on each node**
46+
[discrete]
47+
[[diagnose-task-queue-hot-thread]]
48+
===== Inspect hot threads on each node
3349

34-
If a particular thread pool queue is backed up, you can periodically poll the
35-
<<cluster-nodes-hot-threads,Nodes hot threads>> API to determine if the thread
36-
has sufficient resources to progress and gauge how quickly it is progressing.
50+
If a particular thread pool queue is backed up, periodically poll the
51+
<<cluster-nodes-hot-threads,nodes hot threads API>> to gauge the thread's
52+
progression and ensure it has sufficient resources:
3753

3854
[source,console]
3955
----
4056
GET /_nodes/hot_threads
4157
----
4258

43-
**Look for long running node tasks**
59+
Although the hot threads API response does not list the specific tasks running on a thread,
60+
it provides a summary of the thread's activities. You can correlate a hot threads response
61+
with a <<tasks,task management API response>> to identify any overlap with specific tasks. For
62+
example, if the hot threads response indicates the thread is `performing a search query`, you can
63+
<<diagnose-task-queue-long-running-node-tasks,check for long-running search tasks>> using the task management API.
64+
65+
[discrete]
66+
[[diagnose-task-queue-long-running-node-tasks]]
67+
===== Identify long-running node tasks
4468

45-
Long-running tasks can also cause a backlog. You can use the <<tasks,task
46-
management>> API to get information about the node tasks that are running.
47-
Check the `running_time_in_nanos` to identify tasks that are taking an
48-
excessive amount of time to complete.
69+
Long-running tasks can also cause a backlog. Use the <<tasks,task
70+
management API>> to check for excessive `running_time_in_nanos` values:
4971

5072
[source,console]
5173
----
5274
GET /_tasks?pretty=true&human=true&detailed=true
5375
----
5476

55-
If a particular `action` is suspected, you can filter the tasks further. The most common long-running tasks are <<docs-bulk,bulk index>>- or search-related.
77+
You can filter on a specific `action`, such as <<docs-bulk,bulk indexing>> or search-related tasks.
78+
These tend to be long-running.
5679

57-
* Filter for <<docs-bulk,bulk index>> actions:
80+
* Filter on <<docs-bulk,bulk index>> actions:
5881
+
5982
[source,console]
6083
----
6184
GET /_tasks?human&detailed&actions=indices:data/write/bulk
6285
----
6386

64-
* Filter for search actions:
87+
* Filter on search actions:
6588
+
6689
[source,console]
6790
----
6891
GET /_tasks?human&detailed&actions=indices:data/write/search
6992
----
7093

71-
The API response may contain additional tasks columns, including `description` and `header`, which provides the task parameters, target, and requestor. You can use this information to perform further diagnosis.
94+
Long-running tasks might need to be <<resolve-task-queue-backlog-stuck-tasks,canceled>>.
7295

73-
**Look for long running cluster tasks**
96+
[discrete]
97+
[[diagnose-task-queue-long-running-cluster-tasks]]
98+
===== Look for long-running cluster tasks
7499

75-
A task backlog might also appear as a delay in synchronizing the cluster state. You
76-
can use the <<cluster-pending,cluster pending tasks API>> to get information
77-
about the pending cluster state sync tasks that are running.
100+
Use the <<cluster-pending,cluster pending tasks API>> to identify delays
101+
in cluster state synchronization:
78102

79103
[source,console]
80104
----
81105
GET /_cluster/pending_tasks
82106
----
83107

84-
Check the `timeInQueue` to identify tasks that are taking an excessive amount
85-
of time to complete.
108+
Tasks with a high `timeInQueue` value are likely contributing to the backlog and might
109+
need to be <<resolve-task-queue-backlog-stuck-tasks,canceled>>.
86110

87111
[discrete]
88112
[[resolve-task-queue-backlog]]
89-
==== Resolve a task queue backlog
113+
==== Recommendations
114+
115+
After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks.
90116

91-
**Increase available resources**
117+
[discrete]
118+
[[resolve-task-queue-backlog-resources]]
119+
===== Increase available resources
92120

93-
If tasks are progressing slowly and the queue is backing up,
94-
you might need to take steps to <<reduce-cpu-usage>>.
121+
If tasks are progressing slowly, try <<reduce-cpu-usage,reducing CPU usage>>.
95122

96-
In some cases, increasing the thread pool size might help.
97-
For example, the `force_merge` thread pool defaults to a single thread.
123+
In some cases, you might need to increase the thread pool size. For example, the `force_merge` thread pool defaults to a single thread.
98124
Increasing the size to 2 might help reduce a backlog of force merge requests.
99125

100-
**Cancel stuck tasks**
126+
[discrete]
127+
[[resolve-task-queue-backlog-stuck-tasks]]
128+
===== Cancel stuck tasks
129+
130+
If an active task's <<diagnose-task-queue-hot-thread,hot thread>> shows no progress, consider <<task-cancellation,canceling the task>>.
131+
132+
[discrete]
133+
[[resolve-task-queue-backlog-hotspotting]]
134+
===== Address hot spotting
135+
136+
If a specific node's thread pool is depleting faster than others, try addressing
137+
uneven node resource utilization, also known as hot spotting.
138+
For details on actions you can take, such as rebalancing shards, see <<hotspotting>>.
139+
140+
[discrete]
141+
==== Resources
142+
143+
Related symptoms:
144+
145+
* <<high-cpu-usage>>
146+
* <<rejected-requests>>
147+
* <<hotspotting>>
101148

102-
If you find the active task's hot thread isn't progressing and there's a backlog,
103-
consider canceling the task.
149+
// TODO add link to standard Additional resources when that topic exists

0 commit comments

Comments
 (0)