|
1 | 1 | [[task-queue-backlog]] |
2 | | -=== Task queue backlog |
| 2 | +=== Backlogged task queue |
3 | 3 |
|
4 | | -A backlogged task queue can prevent tasks from completing and put the cluster |
5 | | -into an unhealthy state. Resource constraints, a large number of tasks being |
6 | | -triggered at once, and long running tasks can all contribute to a backlogged |
7 | | -task queue. |
| 4 | +******************************* |
| 5 | +*Product:* Elasticsearch + |
| 6 | +*Deployment type:* Elastic Cloud Enterprise, Elastic Cloud Hosted, Elastic Cloud on Kubernetes, Elastic Self-Managed + |
| 7 | +*Versions:* All |
| 8 | +******************************* |
| 9 | + |
| 10 | +A backlogged task queue can prevent tasks from completing and lead to an |
| 11 | +unhealthy cluster state. Contributing factors include resource constraints, |
| 12 | +a large number of tasks triggered at once, and long-running tasks. |
8 | 13 |
|
9 | 14 | [discrete] |
10 | 15 | [[diagnose-task-queue-backlog]] |
11 | | -==== Diagnose a task queue backlog |
| 16 | +==== Diagnose a backlogged task queue |
| 17 | + |
| 18 | +To identify the cause of the backlog, try these diagnostic actions. |
12 | 19 |
|
13 | | -**Check the thread pool status** |
| 20 | +* <<diagnose-task-queue-thread-pool>> |
| 21 | +* <<diagnose-task-queue-hot-thread>> |
| 22 | +* <<diagnose-task-queue-long-running-node-tasks>> |
| 23 | +* <<diagnose-task-queue-long-running-cluster-tasks>> |
| 24 | + |
| 25 | +[discrete] |
| 26 | +[[diagnose-task-queue-thread-pool]] |
| 27 | +===== Check the thread pool status |
14 | 28 |
|
15 | 29 | A <<high-cpu-usage,depleted thread pool>> can result in |
16 | 30 | <<rejected-requests,rejected requests>>. |
17 | 31 |
|
18 | | -Thread pool depletion might be restricted to a specific <<data-tiers,data tier>>. If <<hotspotting,hot spotting>> is occuring, one node might experience depletion faster than other nodes, leading to performance issues and a growing task backlog. |
19 | | - |
20 | | -You can use the <<cat-thread-pool,cat thread pool API>> to see the number of |
21 | | -active threads in each thread pool and how many tasks are queued, how many |
22 | | -have been rejected, and how many have completed. |
| 32 | +Use the <<cat-thread-pool,cat thread pool API>> to monitor |
| 33 | +active threads, queued tasks, rejections, and completed tasks: |
23 | 34 |
|
24 | 35 | [source,console] |
25 | 36 | ---- |
26 | 37 | GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed |
27 | 38 | ---- |
28 | 39 |
|
29 | | -The `active` and `queue` statistics are instantaneous while the `rejected` and |
30 | | -`completed` statistics are cumulative from node startup. |
| 40 | +* Look for high `active` and `queue` metrics, which indicate potential bottlenecks |
| 41 | +and opportunities to <<reduce-cpu-usage,reduce CPU usage>>. |
| 42 | +* Determine whether thread pool issues are specific to a <<data-tiers,data tier>>. |
| 43 | +* Check whether a specific node's thread pool is depleting faster than others. This |
| 44 | +might indicate <<resolve-task-queue-backlog-hotspotting, hot spotting>>. |
31 | 45 |
|
32 | | -**Inspect the hot threads on each node** |
| 46 | +[discrete] |
| 47 | +[[diagnose-task-queue-hot-thread]] |
| 48 | +===== Inspect hot threads on each node |
33 | 49 |
|
34 | | -If a particular thread pool queue is backed up, you can periodically poll the |
35 | | -<<cluster-nodes-hot-threads,Nodes hot threads>> API to determine if the thread |
36 | | -has sufficient resources to progress and gauge how quickly it is progressing. |
| 50 | +If a particular thread pool queue is backed up, periodically poll the |
| 51 | +<<cluster-nodes-hot-threads,nodes hot threads API>> to gauge the thread's |
| 52 | +progression and ensure it has sufficient resources: |
37 | 53 |
|
38 | 54 | [source,console] |
39 | 55 | ---- |
40 | 56 | GET /_nodes/hot_threads |
41 | 57 | ---- |
42 | 58 |
|
43 | | -**Look for long running node tasks** |
| 59 | +Although the hot threads API response does not list the specific tasks running on a thread, |
| 60 | +it provides a summary of the thread's activities. You can correlate a hot threads response |
| 61 | +with a <<tasks,task management API response>> to identify any overlap with specific tasks. For |
| 62 | +example, if the hot threads response indicates the thread is `performing a search query`, you can |
| 63 | +<<diagnose-task-queue-long-running-node-tasks,check for long-running search tasks>> using the task management API. |
| 64 | + |
| 65 | +[discrete] |
| 66 | +[[diagnose-task-queue-long-running-node-tasks]] |
| 67 | +===== Identify long-running node tasks |
44 | 68 |
|
45 | | -Long-running tasks can also cause a backlog. You can use the <<tasks,task |
46 | | -management>> API to get information about the node tasks that are running. |
47 | | -Check the `running_time_in_nanos` to identify tasks that are taking an |
48 | | -excessive amount of time to complete. |
| 69 | +Long-running tasks can also cause a backlog. Use the <<tasks,task |
| 70 | +management API>> to check for excessive `running_time_in_nanos` values: |
49 | 71 |
|
50 | 72 | [source,console] |
51 | 73 | ---- |
52 | 74 | GET /_tasks?pretty=true&human=true&detailed=true |
53 | 75 | ---- |
54 | 76 |
|
55 | | -If a particular `action` is suspected, you can filter the tasks further. The most common long-running tasks are <<docs-bulk,bulk index>>- or search-related. |
| 77 | +You can filter on a specific `action`, such as <<docs-bulk,bulk indexing>> or search-related tasks. |
| 78 | +These tend to be long-running. |
56 | 79 |
|
57 | | -* Filter for <<docs-bulk,bulk index>> actions: |
| 80 | +* Filter on <<docs-bulk,bulk index>> actions: |
58 | 81 | + |
59 | 82 | [source,console] |
60 | 83 | ---- |
61 | 84 | GET /_tasks?human&detailed&actions=indices:data/write/bulk |
62 | 85 | ---- |
63 | 86 |
|
64 | | -* Filter for search actions: |
| 87 | +* Filter on search actions: |
65 | 88 | + |
66 | 89 | [source,console] |
67 | 90 | ---- |
68 | 91 | GET /_tasks?human&detailed&actions=indices:data/write/search |
69 | 92 | ---- |
70 | 93 |
|
71 | | -The API response may contain additional tasks columns, including `description` and `header`, which provides the task parameters, target, and requestor. You can use this information to perform further diagnosis. |
| 94 | +Long-running tasks might need to be <<resolve-task-queue-backlog-stuck-tasks,canceled>>. |
72 | 95 |
|
73 | | -**Look for long running cluster tasks** |
| 96 | +[discrete] |
| 97 | +[[diagnose-task-queue-long-running-cluster-tasks]] |
| 98 | +===== Look for long-running cluster tasks |
74 | 99 |
|
75 | | -A task backlog might also appear as a delay in synchronizing the cluster state. You |
76 | | -can use the <<cluster-pending,cluster pending tasks API>> to get information |
77 | | -about the pending cluster state sync tasks that are running. |
| 100 | +Use the <<cluster-pending,cluster pending tasks API>> to identify delays |
| 101 | +in cluster state synchronization: |
78 | 102 |
|
79 | 103 | [source,console] |
80 | 104 | ---- |
81 | 105 | GET /_cluster/pending_tasks |
82 | 106 | ---- |
83 | 107 |
|
84 | | -Check the `timeInQueue` to identify tasks that are taking an excessive amount |
85 | | -of time to complete. |
| 108 | +Tasks with a high `timeInQueue` value are likely contributing to the backlog and might |
| 109 | +need to be <<resolve-task-queue-backlog-stuck-tasks,canceled>>. |
86 | 110 |
|
87 | 111 | [discrete] |
88 | 112 | [[resolve-task-queue-backlog]] |
89 | | -==== Resolve a task queue backlog |
| 113 | +==== Recommendations |
| 114 | + |
| 115 | +After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks. |
90 | 116 |
|
91 | | -**Increase available resources** |
| 117 | +[discrete] |
| 118 | +[[resolve-task-queue-backlog-resources]] |
| 119 | +===== Increase available resources |
92 | 120 |
|
93 | | -If tasks are progressing slowly and the queue is backing up, |
94 | | -you might need to take steps to <<reduce-cpu-usage>>. |
| 121 | +If tasks are progressing slowly, try <<reduce-cpu-usage,reducing CPU usage>>. |
95 | 122 |
|
96 | | -In some cases, increasing the thread pool size might help. |
97 | | -For example, the `force_merge` thread pool defaults to a single thread. |
| 123 | +In some cases, you might need to increase the thread pool size. For example, the `force_merge` thread pool defaults to a single thread. |
98 | 124 | Increasing the size to 2 might help reduce a backlog of force merge requests. |
99 | 125 |
|
100 | | -**Cancel stuck tasks** |
| 126 | +[discrete] |
| 127 | +[[resolve-task-queue-backlog-stuck-tasks]] |
| 128 | +===== Cancel stuck tasks |
| 129 | + |
| 130 | +If an active task's <<diagnose-task-queue-hot-thread,hot thread>> shows no progress, consider <<task-cancellation,canceling the task>>. |
| 131 | + |
| 132 | +[discrete] |
| 133 | +[[resolve-task-queue-backlog-hotspotting]] |
| 134 | +===== Address hot spotting |
| 135 | + |
| 136 | +If a specific node's thread pool is depleting faster than others, try addressing |
| 137 | +uneven node resource utilization, also known as hot spotting. |
| 138 | +For details on actions you can take, such as rebalancing shards, see <<hotspotting>>. |
| 139 | + |
| 140 | +[discrete] |
| 141 | +==== Resources |
| 142 | + |
| 143 | +Related symptoms: |
| 144 | + |
| 145 | +* <<high-cpu-usage>> |
| 146 | +* <<rejected-requests>> |
| 147 | +* <<hotspotting>> |
101 | 148 |
|
102 | | -If you find the active task's hot thread isn't progressing and there's a backlog, |
103 | | -consider canceling the task. |
| 149 | +// TODO add link to standard Additional resources when that topic exists |
0 commit comments