diff --git a/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc b/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc index 5aa6a0129c2d4..f233f22cb3fbe 100644 --- a/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc +++ b/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc @@ -1,103 +1,149 @@ [[task-queue-backlog]] -=== Task queue backlog +=== Backlogged task queue -A backlogged task queue can prevent tasks from completing and put the cluster -into an unhealthy state. Resource constraints, a large number of tasks being -triggered at once, and long running tasks can all contribute to a backlogged -task queue. +******************************* +*Product:* Elasticsearch + +*Deployment type:* Elastic Cloud Enterprise, Elastic Cloud Hosted, Elastic Cloud on Kubernetes, Elastic Self-Managed + +*Versions:* All +******************************* + +A backlogged task queue can prevent tasks from completing and lead to an +unhealthy cluster state. Contributing factors include resource constraints, +a large number of tasks triggered at once, and long-running tasks. [discrete] [[diagnose-task-queue-backlog]] -==== Diagnose a task queue backlog +==== Diagnose a backlogged task queue + +To identify the cause of the backlog, try these diagnostic actions. -**Check the thread pool status** +* <> +* <> +* <> +* <> + +[discrete] +[[diagnose-task-queue-thread-pool]] +===== Check the thread pool status A <> can result in <>. -Thread pool depletion might be restricted to a specific <>. If <> is occuring, one node might experience depletion faster than other nodes, leading to performance issues and a growing task backlog. - -You can use the <> to see the number of -active threads in each thread pool and how many tasks are queued, how many -have been rejected, and how many have completed. +Use the <> to monitor +active threads, queued tasks, rejections, and completed tasks: [source,console] ---- GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed ---- -The `active` and `queue` statistics are instantaneous while the `rejected` and -`completed` statistics are cumulative from node startup. +* Look for high `active` and `queue` metrics, which indicate potential bottlenecks +and opportunities to <>. +* Determine whether thread pool issues are specific to a <>. +* Check whether a specific node's thread pool is depleting faster than others. This +might indicate <>. -**Inspect the hot threads on each node** +[discrete] +[[diagnose-task-queue-hot-thread]] +===== Inspect hot threads on each node -If a particular thread pool queue is backed up, you can periodically poll the -<> API to determine if the thread -has sufficient resources to progress and gauge how quickly it is progressing. +If a particular thread pool queue is backed up, periodically poll the +<> to gauge the thread's +progression and ensure it has sufficient resources: [source,console] ---- GET /_nodes/hot_threads ---- -**Look for long running node tasks** +Although the hot threads API response does not list the specific tasks running on a thread, +it provides a summary of the thread's activities. You can correlate a hot threads response +with a <> to identify any overlap with specific tasks. For +example, if the hot threads response indicates the thread is `performing a search query`, you can +<> using the task management API. + +[discrete] +[[diagnose-task-queue-long-running-node-tasks]] +===== Identify long-running node tasks -Long-running tasks can also cause a backlog. You can use the <> API to get information about the node tasks that are running. -Check the `running_time_in_nanos` to identify tasks that are taking an -excessive amount of time to complete. +Long-running tasks can also cause a backlog. Use the <> to check for excessive `running_time_in_nanos` values: [source,console] ---- GET /_tasks?pretty=true&human=true&detailed=true ---- -If a particular `action` is suspected, you can filter the tasks further. The most common long-running tasks are <>- or search-related. +You can filter on a specific `action`, such as <> or search-related tasks. +These tend to be long-running. -* Filter for <> actions: +* Filter on <> actions: + [source,console] ---- GET /_tasks?human&detailed&actions=indices:data/write/bulk ---- -* Filter for search actions: +* Filter on search actions: + [source,console] ---- GET /_tasks?human&detailed&actions=indices:data/write/search ---- -The API response may contain additional tasks columns, including `description` and `header`, which provides the task parameters, target, and requestor. You can use this information to perform further diagnosis. +Long-running tasks might need to be <>. -**Look for long running cluster tasks** +[discrete] +[[diagnose-task-queue-long-running-cluster-tasks]] +===== Look for long-running cluster tasks -A task backlog might also appear as a delay in synchronizing the cluster state. You -can use the <> to get information -about the pending cluster state sync tasks that are running. +Use the <> to identify delays +in cluster state synchronization: [source,console] ---- GET /_cluster/pending_tasks ---- -Check the `timeInQueue` to identify tasks that are taking an excessive amount -of time to complete. +Tasks with a high `timeInQueue` value are likely contributing to the backlog and might +need to be <>. [discrete] [[resolve-task-queue-backlog]] -==== Resolve a task queue backlog +==== Recommendations + +After identifying problematic threads and tasks, resolve the issue by increasing resources or canceling tasks. -**Increase available resources** +[discrete] +[[resolve-task-queue-backlog-resources]] +===== Increase available resources -If tasks are progressing slowly and the queue is backing up, -you might need to take steps to <>. +If tasks are progressing slowly, try <>. -In some cases, increasing the thread pool size might help. -For example, the `force_merge` thread pool defaults to a single thread. +In some cases, you might need to increase the thread pool size. For example, the `force_merge` thread pool defaults to a single thread. Increasing the size to 2 might help reduce a backlog of force merge requests. -**Cancel stuck tasks** +[discrete] +[[resolve-task-queue-backlog-stuck-tasks]] +===== Cancel stuck tasks + +If an active task's <> shows no progress, consider <>. + +[discrete] +[[resolve-task-queue-backlog-hotspotting]] +===== Address hot spotting + +If a specific node's thread pool is depleting faster than others, try addressing +uneven node resource utilization, also known as hot spotting. +For details on actions you can take, such as rebalancing shards, see <>. + +[discrete] +==== Resources + +Related symptoms: + +* <> +* <> +* <> -If you find the active task's hot thread isn't progressing and there's a backlog, -consider canceling the task. \ No newline at end of file +// TODO add link to standard Additional resources when that topic exists