diff --git a/troubleshoot/elasticsearch/hotspotting.md b/troubleshoot/elasticsearch/hotspotting.md
index 7b09e6e4a7..01c5a980cc 100644
--- a/troubleshoot/elasticsearch/hotspotting.md
+++ b/troubleshoot/elasticsearch/hotspotting.md
@@ -23,6 +23,10 @@ Watch [this video](https://www.youtube.com/watch?v=Q5ODJ5nIKAM) for a walkthroug
## Detect hot spotting [detect]
+To check for hot spotting you can investigate both active utilization levels and historical node statistics.
+
+### Active [detect-active]
+
Hot spotting most commonly surfaces as significantly elevated resource utilization (of `disk.percent`, `heap.percent`, or `cpu`) among a subset of nodes as reported via [cat nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes). Individual spikes aren’t necessarily problematic, but if utilization repeatedly spikes or consistently remains high over time (for example longer than 30 seconds), the resource may be experiencing problematic hot spotting.
For example, let’s show case two separate plausible issues using cat nodes:
@@ -42,6 +46,38 @@ node_3 - hirstmv 25 90 10
Here we see two significantly unique utilizations: where the master node is at `cpu: 95` and a hot node is at `disk.used_percent: 90%`. This would indicate hot spotting was occurring on these two nodes, and not necessarily from the same root cause.
+### Historical [detect-historical]
+
+A secondary method to notice hot spotting build up is to poll the [node statistics API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-stats) for index-related performance metrics.
+
+```console
+GET _nodes/stats?pretty=true&filter_path=nodes.*.name,nodes.*.roles,nodes.*.indices
+```
+
+This request returns node operational metrics such as `query`, `refresh`, and `index`. It allows you to gauge:
+
+* the total events attempted per node
+* the node's average processing time per event type
+
+These metrics accumulate during the uptime for each individual node. To help view the output, you can parse the response using a third-party tool such as [JQ](https://jqlang.github.io/jq/):
+
+```bash
+cat nodes_stats.json | jq -rc '.nodes[]|.name as $n|.roles as $r|.indices|to_entries[]|.key as $m|.value|select(.total and .total_time_in_millis)|select(.total>0)|{node:$n, roles:$r, metric:$m, total:.total, avg_millis:(.total_time_in_millis?/.total|round)}'
+```
+
+The results indicate that multiple major operations are non-performant across nodes, suggesting that the cluster is likely under-provisioned. If a particular operation type or node stands out, it likely indicates [shard distribution issues](#causes-shards) which you might compare against [indices stats](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-stats).
+
+```console
+GET /_stats?level=shards&human&expand_wildcards=all&ignore_unavailable=true
+```
+
+These metrics accumulate from the history for each individual shard. You can parse this response with a tool like JQ to compare with earlier performance:
+
+```bash
+cat indices_stats.json | jq -rc '.indices|to_entries[]|.key as $i|.value.shards[]|to_entries[]|.key as $sh|.value|.routing.primary as $p|.routing.node[:4] as $n|to_entries[]|.key as $m|.value|select(.total and .total_time_in_millis)|select(.total>0)|{index:$i, shard:$sh, primary:$p, node:$n, metric:$m, total:.total, avg_millis:(.total_time_in_millis/.total|round)}'
+```
+
+
## Causes [causes]
@@ -149,36 +185,4 @@ cat shard_stats.json | jq -rc 'sort_by(-.avg_indexing)[]' | head
### Task loads [causes-tasks]
-Shard distribution problems will most-likely surface as task load as seen above in the [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) example. It is also possible for tasks to hot spot a node either due to individual qualitative expensiveness or overall quantitative traffic loads.
-
-For example, if [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) reported a high queue on the `warmer` [thread pool](elasticsearch://reference/elasticsearch/configuration-reference/thread-pool-settings.md), you would look-up the effected node’s [hot threads](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads). Let’s say it reported `warmer` threads at `100% cpu` related to `GlobalOrdinalsBuilder`. This would let you know to inspect [field data’s global ordinals](elasticsearch://reference/elasticsearch/mapping-reference/eager-global-ordinals.md).
-
-Alternatively, let’s say [cat nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes) shows a hot spotted master node and [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) shows general queuing across nodes. This would suggest the master node is overwhelmed. To resolve this, first ensure [hardware high availability](../../deploy-manage/production-guidance/availability-and-resilience/resilience-in-small-clusters.md) setup and then look to ephemeral causes. In this example, [the nodes hot threads API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads) reports multiple threads in `other` which indicates they’re waiting on or blocked by either garbage collection or I/O.
-
-For either of these example situations, a good way to confirm the problematic tasks is to look at longest running non-continuous (designated `[c]`) tasks via [cat task management](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-tasks). This can be supplemented checking longest running cluster sync tasks via [cat pending tasks](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-pending-tasks). Using a third example,
-
-```console
-GET _cat/tasks?v&s=time:desc&h=type,action,running_time,node,cancellable
-```
-
-This could return:
-
-```console-result
-type action running_time node cancellable
-direct indices:data/read/eql 10m node_1 true
-...
-```
-
-This surfaces a problematic [EQL query](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-eql-search). We can gain further insight on it via [the task management API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks),
-
-```console
-GET _tasks?human&detailed
-```
-
-Its response contains a `description` that reports this query:
-
-```eql
-indices[winlogbeat-*,logs-window*], sequence by winlog.computer_name with maxspan=1m\n\n[authentication where host.os.type == "windows" and event.action:"logged-in" and\n event.outcome == "success" and process.name == "svchost.exe" ] by winlog.event_data.TargetLogonId
-```
-
-This lets you know which indices to check (`winlogbeat-*,logs-window*`), as well as the [EQL search](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-eql-search) request body. Most likely this is [SIEM related](/solutions/security.md). You can combine this with [audit logging](../../deploy-manage/security/logging-configuration/enabling-audit-logs.md) as needed to trace the request source.
+Shard distribution problems will most-likely surface as task load, as seen above in the [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) example. It is also possible for tasks to hot spot a node due either to individual qualitative expensiveness or to overall quantitative traffic loads, which will surface in [backlogged tasks](/troubleshoot/elasticsearch/task-queue-backlog.md).
diff --git a/troubleshoot/elasticsearch/task-queue-backlog.md b/troubleshoot/elasticsearch/task-queue-backlog.md
index de98edcbab..3d3fcf8e14 100644
--- a/troubleshoot/elasticsearch/task-queue-backlog.md
+++ b/troubleshoot/elasticsearch/task-queue-backlog.md
@@ -12,44 +12,61 @@ products:
% **Product:** Elasticsearch
**Deployment type:** Elastic Cloud % Enterprise, Elastic Cloud Hosted, Elastic Cloud on Kubernetes, Elastic
% Self-Managed
**Versions:** All
-A backlogged task queue can prevent tasks from completing and lead to an unhealthy cluster state. Contributing factors include resource constraints, a large number of tasks triggered at once, and long-running tasks.
+A backlogged task queue can lead to [rejected requests](/troubleshoot/elasticsearch/rejected-requests.md) or an [unhealthy cluster state](/troubleshoot/elasticsearch/red-yellow-cluster-status.md). Contributing factors can include [uneven or resource constrained hardware](/troubleshoot/elasticsearch/hotspotting.md#causes-hardware), a large number of tasks triggered at the same time, expensive tasks that are using [high CPU](/troubleshoot/elasticsearch/high-cpu-usage.md) or are inducing [high JVM](/troubleshoot/elasticsearch/high-jvm-memory-pressure.md), and long-running tasks.
## Diagnose a backlogged task queue [diagnose-task-queue-backlog]
To identify the cause of the backlog, try these diagnostic actions.
-* [Check the thread pool status](#diagnose-task-queue-thread-pool)
-* [Inspect hot threads on each node](#diagnose-task-queue-hot-thread)
+* [Check thread pool status](#diagnose-task-queue-thread-pool)
+* [Inspect node hot threads](#diagnose-task-queue-hot-thread)
* [Identify long-running node tasks](#diagnose-task-queue-long-running-node-tasks)
* [Look for long-running cluster tasks](#diagnose-task-queue-long-running-cluster-tasks)
* [Monitor slow logs](#diagnose-task-slow-logs)
-### Check the thread pool status [diagnose-task-queue-thread-pool]
-
-A [depleted thread pool](high-cpu-usage.md) can result in [rejected requests](rejected-requests.md).
+### Check thread pool status [diagnose-task-queue-thread-pool]
Use the [cat thread pool API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) to monitor active threads, queued tasks, rejections, and completed tasks:
```console
-GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
+GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,pool_size,active,queue_size,queue,rejected,completed
```
-* Look for high `active` and `queue` metrics, which indicate potential bottlenecks and opportunities to [reduce CPU usage](high-cpu-usage.md#reduce-cpu-usage).
-* Determine whether thread pool issues are specific to a [data tier](../../manage-data/lifecycle/data-tiers.md).
-* Check whether a specific node’s thread pool is depleting faster than others. This might indicate [hot spotting](#resolve-task-queue-backlog-hotspotting).
+By way of explanation on these [thread pool](elasticsearch://reference/elasticsearch/configuration-reference/thread-pool-settings.md) metrics:
+* the `active` and `queue` statistics are point-in-time
+* the `rejected` and `completed` statistics are cumulative from node start-up
+* the thread pool will fill `active` until it reaches the `pool_size` at which point it will fill `queue` until it reaches the `queue_size` after which it will [rejected requests](/troubleshoot/elasticsearch/rejected-requests.md)
-### Inspect hot threads on each node [diagnose-task-queue-hot-thread]
+There are a number of things that you can check as potential causes for the queue backlog:
-If a particular thread pool queue is backed up, periodically poll the [nodes hot threads API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads) to gauge the thread’s progression and ensure it has sufficient resources:
+* Look for continually high `queue` metrics, which indicate long-running tasks or [CPU-expensive tasks](high-cpu-usage.md).
+* Look for bursts of elevated `queue` metrics, which indicate opportunities to spread traffic volume.
+* Determine whether thread pool issues are specific to a [node role](/deploy-manage/distributed-architecture/clusters-nodes-shards/node-roles.md).
+* Check whether a specific node is depleting faster than others within a [data tier](/manage-data/lifecycle/data-tiers.md). This might indicate [hot spotting](/troubleshoot/elasticsearch/hotspotting.md).
-```console
-GET /_nodes/hot_threads
-```
-Although the hot threads API response does not list the specific tasks running on a thread, it provides a summary of the thread’s activities. You can correlate a hot threads response with a [task management API response](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks) to identify any overlap with specific tasks. For example, if the hot threads response indicates the thread is `performing a search query`, you can [check for long-running search tasks](#diagnose-task-queue-long-running-node-tasks) using the task management API.
+### Inspect node hot threads [diagnose-task-queue-hot-thread]
+
+If a particular thread pool queue is backed up, periodically poll the CPU-related API's to gauge task progression vs resource constraints:
+
+* the [nodes hot threads API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads)
+
+ ```console
+ GET /_nodes/hot_threads
+ ```
+
+* the [cat nodes API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes)
+
+ ```console
+ GET _cat/nodes?v=true&s=cpu:desc
+ ```
+
+If `cpu` is consistently elevated or a hot thread's stack trace does not rotate over an extended period, investigate [high CPU usage](high-cpu-usage.md#check-hot-threads).
+
+Although the hot threads API response does not list the specific tasks running on a thread, it provides a summary of the thread’s activities. You can correlate a hot threads response with a [task management API response](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks) to identify any overlap with specific tasks. For example, if hot threads suggest the node is spending time in `search`, filter the [Task Management API for search tasks](#diagnose-task-queue-long-running-node-tasks).
### Identify long-running node tasks [diagnose-task-queue-long-running-node-tasks]
@@ -60,34 +77,46 @@ Long-running tasks can also cause a backlog. Use the [task management API](https
GET /_tasks?pretty=true&human=true&detailed=true
```
-You can filter on a specific `action`, such as [bulk indexing](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-bulk) or search-related tasks. These tend to be long-running.
+You can filter on a specific `action`, such as [bulk indexing](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-bulk) or search-related tasks. If investigating particular nodes, this API can be filtered to specific `nodes`.
* Filter on [bulk index](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-bulk) actions:
```console
GET /_tasks?human&detailed&actions=indices:*write*
+
+ GET /_tasks?human&detailed&actions=indices:*write*&nodes=
```
* Filter on search actions:
```console
GET /_tasks?human&detailed&actions=indices:*search*
+
+ GET /_tasks?human&detailed&actions=indices:*search*&nodes=
```
Long-running tasks might need to be [canceled](#resolve-task-queue-backlog-stuck-tasks).
-See this [this video](https://www.youtube.com/watch?v=lzw6Wla92NY) for a walkthrough of troubleshooting the [task management API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks) output.
+Refer to [this video](https://www.youtube.com/watch?v=lzw6Wla92NY) for a walkthrough of how to troubleshoot the [task management API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks) output.
+
+You can also check the [Tune for search speed](/deploy-manage/production-guidance/optimize-performance/search-speed.md) and [Tune for indexing speed](/deploy-manage/production-guidance/optimize-performance/indexing-speed.md) pages for more information.
### Look for long-running cluster tasks [diagnose-task-queue-long-running-cluster-tasks]
-Use the [cluster pending tasks API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-pending-tasks) to identify delays in cluster state synchronization:
+Use the [cat pending tasks API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-pending-tasks) to identify delays in cluster state synchronization:
```console
-GET /_cluster/pending_tasks
+GET /_cat/pending_tasks?v=true
```
-Tasks with a high `timeInQueue` value are likely contributing to the backlog and might need to be [canceled](#resolve-task-queue-backlog-stuck-tasks).
+Cluster state synchronization can be expected to fall behind when a [cluster is unstable](/troubleshoot/elasticsearch/troubleshooting-unstable-cluster.md), but otherwise this state usually indicates an unworkable [cluster setting](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cluster-get-settings) override or traffic pattern.
+
+There are a few common `source` issues to check for:
+
+* `ilm-`: [{{ilm}} ({{ilm-init}})](/manage-data/lifecycle/index-lifecycle-management.md) polls every `10m` by default, as determined by the [`indices.lifecycle.poll_interval`](elasticsearch://reference/elasticsearch/configuration-reference/index-lifecycle-management-settings.md) setting. This starts asynchronous tasks executed by the node tasks. If {{ilm-init}} continually reports as a cluster pending task, this setting likely is being overridden. Otherwise, the cluster likely has misconfigured [indices count relative to master heap size](/deploy-manage/production-guidance/optimize-performance/size-shards.md#shard-count-recommendation).
+* `put-mapping`: {{es}} enables [dynamic mapping](/manage-data/data-store/mapping/dynamic-mapping.md) by default. This, or the [update mapping API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-mapping), triggers a mapping update. In this case, the corresponding cluster log will contain an `update_mapping` entry with the name of the affected index.
+* `shard-started`: Indicates [active shard recoveries](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-recovery). Overriding [`cluster.routing.allocation.*` settings](elasticsearch://reference/elasticsearch/configuration-reference/cluster-level-shard-allocation-routing-settings.md#cluster-shard-allocation-settings) can cause pending tasks and recoveries to back up.
### Monitor slow logs [diagnose-task-slow-logs]
@@ -97,19 +126,35 @@ For example, you can review slow search logs later using the [search profiler](e
## Recommendations [resolve-task-queue-backlog]
-After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks.
+Per before, when task backlogs occur it is frequently due to
+* a traffic volume spike
+* [expensive tasks](#diagnose-task-queue-hot-thread) that are causing [high CPU](/troubleshoot/elasticsearch/high-cpu-usage.md)
+* [long-running tasks](#diagnose-task-queue-long-running-node-tasks)
+* [hot spotting](hotspotting.md), particularly from [uneven or resource constrained hardware](/troubleshoot/elasticsearch/hotspotting.md#causes-hardware)
-### Increase available resources [resolve-task-queue-backlog-resources]
+Many of these can be investigated in isolation as unintended traffic pattern or configuration changes. Refer to the following recommendations to address repeat or long standing symptoms.
+
+### Address CPU-intensive tasks [resolve-task-queue-backlog-cpu]
-If tasks are progressing slowly, try [reducing CPU usage](high-cpu-usage.md#reduce-cpu-usage).
+If an individual task is causing a [thread pool `queue`](#diagnose-task-queue-thread-pool) due to [high CPU usage](high-cpu-usage.md), try [cancelling the task](#resolve-task-queue-backlog-stuck-tasks) and then optimizing it before retrying.
-In some cases, you might need to increase the thread pool size. For example, the `force_merge` thread pool defaults to a single thread. Increasing the size to 2 might help reduce a backlog of force merge requests.
+This problem can surface due to a number of possible causes:
+
+* Creating new tasks or modifying scheduled tasks which either run frequently or are broad in their effect, such as [{{ilm}}](/manage-data/lifecycle/index-lifecycle-management.md) policies or [rules](/explore-analyze/alerts-cases.md)
+* Performing traffic load testing
+* Doing extended look-backs, especially across [data tiers](/manage-data/lifecycle/data-tiers.md)
+* [Searching](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-search) or performing [bulk updates](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-bulk) to a high number of indices in a single request
### Cancel stuck tasks [resolve-task-queue-backlog-stuck-tasks]
-If an active task’s [hot thread](#diagnose-task-queue-hot-thread) shows no progress, consider [canceling the task](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks#task-cancellation).
+If an active task’s [hot thread](#diagnose-task-queue-hot-thread) shows no progress, consider [canceling the task](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks#task-cancellation) if it's flagged as `cancellable`.
+
+If you consistently encounter `cancellable` tasks running longer than expected, you might consider reviewing:
+
+* setting a [`search.default_search_timeout`](/solutions/search/the-search-api.md#search-timeout)
+* ensuring [scroll requests are cleared](elasticsearch://reference/elasticsearch/rest-apis/paginate-search-results.md#clear-scroll) in a timely manner
For example, you can use the [task management API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-tasks-list) to identify and cancel searches that consume excessive CPU time.
@@ -155,14 +200,14 @@ For additional tips on how to track and avoid resource-intensive searches, see [
### Address hot spotting [resolve-task-queue-backlog-hotspotting]
-If a specific node’s thread pool is depleting faster than others, try addressing uneven node resource utilization, also known as hot spotting. For details on actions you can take, such as rebalancing shards, see [Hot spotting](hotspotting.md).
+If a specific node’s thread pool is depleting faster than its [data tier](/manage-data/lifecycle/data-tiers.md) peers, try addressing uneven node resource utilization, also known as "hot spotting". For details about reparative actions you can take, such as rebalancing shards, refer to the [Hot spotting](hotspotting.md) troubleshooting documentation.
+### Increase available resources [resolve-task-queue-backlog-resources]
-## Resources [_resources]
+By default, {{es}} allocates processors equal to the number reported available by the operating system. You can override this behaviour by adjusting the value of [`node.processors`](elasticsearch://reference/elasticsearch/configuration-reference/thread-pool-settings.md#node.processors), but this advanced setting should be configured only after you've performed load testing.
-Related symptoms:
+In some cases, you might need to increase the problematic thread pool `size`. For example, it might help to increase a stuck [`force_merge` thread pool](elasticsearch://reference/elasticsearch/configuration-reference/thread-pool-settings.md). If the setting is automatically calculated to `1` based on available CPU processors, then increasing the value to `2` is indicated in `elasticsearch.yml`, for example:
-* [High CPU usage](high-cpu-usage.md)
-* [Rejected requests](rejected-requests.md)
-* [Hot spotting](hotspotting.md)
-* [Troubleshooting overview](/troubleshoot/index.md)
+```yaml
+thread_pool.force_merge.size: 2
+```