diff --git a/troubleshoot/elasticsearch/high-cpu-usage.md b/troubleshoot/elasticsearch/high-cpu-usage.md index a9ad328d07..577082feee 100644 --- a/troubleshoot/elasticsearch/high-cpu-usage.md +++ b/troubleshoot/elasticsearch/high-cpu-usage.md @@ -10,159 +10,169 @@ products: # Symptom: High CPU usage [high-cpu-usage] -{{es}} uses [thread pools](elasticsearch://reference/elasticsearch/configuration-reference/thread-pool-settings.md) to manage CPU resources for concurrent operations. High CPU usage typically means one or more thread pools are running low. +{{es}} uses [thread pools](elasticsearch://reference/elasticsearch/configuration-reference/thread-pool-settings.md) to manage node CPU and JVM resources for concurrent operations. The thread pools are portioned different numbers of threads, frequently based off of the total processors allocated to the node. This helps the node remain responsive while processing either [expensive tasks or a task queue backlog](task-queue-backlog.md). {{es}} [rejects requests](rejected-requests.md) related to a thread pool while its queue is saturated. -If a thread pool is depleted, {{es}} will [reject requests](rejected-requests.md) related to the thread pool. For example, if the `search` thread pool is depleted, {{es}} will reject search requests until more threads are available. +An individual task can spawn work on multiple node threads, frequently within these designated thread pools. It is normal for an [individual thread](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads) to saturate its CPU usage. A thread reporting CPU saturation could reflect either the thread spending its time processing an ask from an individual expensive task or the thread staying busy due to processing asks from multiple tasks. The hot threads report shows a snapshot of Java threads across a time interval. Therefore, the hot threads cannot be directly mapped to any given [node task](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks). -:::{include} /deploy-manage/_snippets/autoops-callout-with-ech.md -::: +A node can temporarily saturate all of the CPU threads allocated to it. It's unusual for this state to be ongoing for an extended period. It might suggest that the node is: +* sized disproportionately to its [data tier](/manage-data/lifecycle/data-tiers.md) peers. +* receiving a volume of requests above its workload capability, for example if the node is sized below the [minimum recommendations](/deploy-manage/deploy/elastic-cloud/elastic-cloud-hosted-planning.md#ec-minimum-recommendations). +* processing an [expensive task](task-queue-backlog.md). +To mitigate performance outages, we default recommend pulling an [{{es}} diagnostic](/troubleshoot/elasticsearch/diagnostic.md) for post-mortem but trying to resolve using [scaling](/deploy-manage/production-guidance/scaling-considerations.md). -## Diagnose high CPU usage [diagnose-high-cpu-usage] +Refer to the sections below to troubleshoot degraded CPU performance. + +## Diagnose high CPU usage [diagnose] ### Check CPU usage [check-cpu-usage] -You can check the CPU usage per node using the [cat nodes API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes): +To check the CPU usage per node, use the [cat nodes API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes): ```console -GET _cat/nodes?v=true&s=cpu:desc +GET _cat/nodes?v=true&s=cpu:desc&h=name,role,master,cpu,load*,allocated_processors ``` -The response’s `cpu` column contains the current CPU usage as a percentage. The `name` column contains the node’s name. Elevated but transient CPU usage is normal. However, if CPU usage is elevated for an extended duration, it should be investigated. +The reported metrics are: -To track CPU usage over time, we recommend enabling monitoring: +* `cpu`: the instantaneous percentage of system CPU usage +* `load_1m`, `load_5m`, and `load_15m`: the average amount of processes waiting for the designated time interval +* `allocated_processors`: number of processors allocated to the node {applies_to}`stack: ga 9.3` -:::::::{applies-switch} +For more detail, refer to the [node statistics](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-stats) API documentation. -::::::{applies-item} { ess:, ece: } -* (Recommended) Enable [logs and metrics](../../deploy-manage/monitor/stack-monitoring/ece-ech-stack-monitoring.md). When logs and metrics are enabled, monitoring information is visible on {{kib}}'s [Stack Monitoring](../../deploy-manage/monitor/monitoring-data/visualizing-monitoring-data.md) page. +These alerting thresholds for these metrics depend on your team's workload-vs-duration needs. However, as a general start point baseline, you might consider investigating if: - You can also enable the [CPU usage threshold alert](../../deploy-manage/monitor/monitoring-data/configure-stack-monitoring-alerts.md) to be notified about potential issues through email. +* (Recommended) CPU usage remains elevated above 95% for an extended interval. +* Load average divided by the node's allocated processors is elevated. This metric by itself is insufficient as a gauge and should be considered alongside elevated response times, as it otherwise might reflect normal background I/O. -* From your deployment menu, view the [**Performance**](../../deploy-manage/monitor/access-performance-metrics-on-elastic-cloud.md) page. On this page, you can view two key metrics: +If CPU usage is deemed concerning, we recommend checking this output for traffic patterns either segmented by or [hot spotted](/troubleshoot/elasticsearch/hotspotting.md) in the columns `role` and `master`. CPU issues spanning an entire data tier suggest a configuration issue or the tier being undersized. CPU issues spanning a subset of nodes within one or more data tiers suggest [hot spotting](/troubleshoot/elasticsearch/hotspotting.md) tasks. - * **CPU usage**: Your deployment’s CPU usage, represented as a percentage. - * **CPU credits**: Your remaining CPU credits, measured in seconds of CPU time. +### Check hot threads [check-hot-threads] -{{ech}} grants [CPU credits](/deploy-manage/deploy/elastic-cloud/ec-vcpu-boost-instance.md) per deployment to provide smaller clusters with performance boosts when needed. High CPU usage can deplete these credits, which might lead to [performance degradation](../monitoring/performance.md) and [increased cluster response times](../monitoring/cluster-response-time.md). -:::::: +High CPU usage frequently correlates to [a long-running task or a backlog of tasks](task-queue-backlog.md). When a node is reporting elevated CPU usage, to correlate the thread to a task use the [nodes hot threads API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads) to check for resource-intensive threads running on it. -::::::{applies-item} { self:, eck: } -* Enable [{{es}} monitoring](../../deploy-manage/monitor/stack-monitoring.md). When logs and metrics are enabled, monitoring information is visible on {{kib}}'s [Stack Monitoring](../../deploy-manage/monitor/monitoring-data/visualizing-monitoring-data.md) page. +```console +GET _nodes/hot_threads +``` - You can also enable the [CPU usage threshold alert](../../deploy-manage/monitor/monitoring-data/configure-stack-monitoring-alerts.md) to be notified about potential issues through email. -:::::: +This API returns a snapshot of hot Java threads. As a simplified example, the response output might appear like the following: + +```text +::: {instance-0000000001}{9fVI1XoXQJCgHwsOPlVEig}{RrJGwEaESRmNs75Gjs1SOg}{instance-0000000001}{10.42.9.84}{10.42.9.84:19058}{himrst}{9.3.0}{7000099-8525000}{region=unknown-region, server_name=instance-0000000001.b84ab96b481f43d791a1a73477a10d40, xpack.installed=true, transform.config_version=10.0.0, ml.config_version=12.0.0, data=hot, logical_availability_zone=zone-1, availability_zone=us-central1-a, instance_configuration=gcp.es.datahot.n2.68x10x45} + Hot threads at 2025-05-14T17:59:30.199Z, interval=500ms, busiestThreads=10000, ignoreIdleThreads=true: + + 88.5% [cpu=88.5%, other=0.0%] (442.5ms out of 500ms) cpu usage by thread '[write]' + 8/10 snapshots sharing following 29 elements + com.fasterxml.jackson.dataformat.smile@2.17.2/com.fasterxml.jackson.dataformat.smile.SmileParser.nextToken(SmileParser.java:434) + org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doAdd(LocalBulk.java:69) + # ... + 2/10 snapshots sharing following 37 elements + app/org.elasticsearch.xcontent/org.elasticsearch.xcontent.support.filtering.FilterPath$FilterPathBuilder.insertNode(FilterPath.java:172) + # ... +``` -::::::: +The response output is formatted as follows: + +```text +::: {NAME}{ID}{...}{HOST_NAME}{ADDRESS}{...}{ROLES}{VERSION}{...}{ATTRIBUTES} + Hot threads at TIMESTAMP, interval=INTERVAL_FROM_API, busiestThreads=THREADS_FROM_API, ignoreIdleThreads=IDLE_FROM_API: + + TOTAL_CPU% [cpu=ELASTIC_CPU%, other=OTHER_CPU%] (Xms out of INTERVAL_FROM_API) cpu usage by thread 'THREAD' + X/... snapshots sharing following X elements + STACKTRACE_SAMPLE + # ... + X/... snapshots sharing following X elements + STACKTRACE_SAMPLE + # ... +``` -### Check hot threads [check-hot-threads] +Three measures of CPU time are reported in the API output: -If a node has high CPU usage, use the [nodes hot threads API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads) to check for resource-intensive threads running on the node. +* `TOTAL_CPU`: the total CPU used by the CPU thread (either by {{es}} or the operating system) +* `ELASTIC_CPU`: the CPU available to {{es}} and used by {{es}} +* `OTHER_CPU`: a miscellaneous bucket for disk/network IO or garbage collection (GC) -```console -GET _nodes/hot_threads -``` +Although `ELASTIC_CPU` is the main driver of elevated `TOTAL_CPU`, you should also investigate the `STACKTRACE_SAMPLE`. These lines frequently emit {{es}} [loggers](/deploy-manage/monitor/logging-configuration.md) but might also surface non-{{es}} processes. Common examples of performance log entries include: -This API returns a breakdown of any hot threads in plain text. High CPU usage frequently correlates to [a long-running task, or a backlog of tasks](task-queue-backlog.md). +* `org.elasticsearch.action.search` or `org.elasticsearch.search` is a [running search](/explore-analyze/index.md) +* `org.elasticsearch.cluster.metadata.Metadata.findAliases` is an [alias](/manage-data/data-store/aliases.md) look-up/resolver +* `org.elasticsearch.common.regex` is [custom Regex code](/explore-analyze/scripting/modules-scripting-regular-expressions-tutorial.md) +* `org.elasticsearch.grok` is [custom Grok code](/explore-analyze/scripting/grok.md) +* `org.elasticsearch.index.fielddata.ordinals.GlobalOrdinalsBuilder.build` is [building global ordinals](elasticsearch:///reference/elasticsearch/mapping-reference/eager-global-ordinals.md) +* `org.elasticsearch.ingest.Pipeline` or `org.elasticsearch.ingest.CompoundProcessor` is an [ingest pipeline](/manage-data/ingest/transform-enrich/ingest-pipelines.md) +* `org.elasticsearch.xpack.core.esql` or `org.elasticsearch.xpack.esql` is a [running ES|QL](/explore-analyze/query-filter/languages/esql-kibana.md) query +If your team would like assistance correlating hot threads and node tasks, kindly [{capture your {es}} diagnostics](/troubleshoot/elasticsearch/diagnostic.md) when you [contact us](/troubleshoot/index.md#contact-us). -## Reduce CPU usage [reduce-cpu-usage] +### Check garbage collection [check-garbage-collection] -The following tips outline the most common causes of high CPU usage and their solutions. +High CPU usage is often caused by excessive JVM garbage collection (GC) activity. This excessive GC typically arises from configuration problems or inefficient queries causing increased heap memory usage. -### Check JVM garbage collection [check-jvm-garbage-collection] +For troubleshooting information, refer to [high JVM memory pressure](/troubleshoot/elasticsearch/high-jvm-memory-pressure.md). -High CPU usage is often caused by excessive JVM garbage collection (GC) activity. This excessive GC typically arises from configuration problems or inefficient queries causing increased heap memory usage. -For optimal JVM performance, garbage collection should meet these criteria: +## Monitor CPU usage [monitor] -| GC type | Completion time | Frequency | -|---------|----------------|---------------------| -| Young GC | <50ms | ~once per 10 seconds | -| Old GC | <1s | ≤once per 10 minutes | +:::{include} /deploy-manage/_snippets/autoops-callout-with-ech.md +::: -Excessive JVM garbage collection usually indicates high heap memory usage. Common potential reasons for increased heap memory usage include: +To track CPU usage over time, we recommend enabling monitoring: -* Oversharding of indices -* Very large aggregation queries -* Excessively large bulk indexing requests -* Inefficient or incorrect mapping definitions -* Improper heap size configuration -* Misconfiguration of JVM new generation ratio (`-XX:NewRatio`) +:::::::{applies-switch} -### Hot spotting [high-cpu-usage-hot-spotting] +::::::{applies-item} { ess:, ece: } +* (Recommend) Enable [AutoOps](/deploy-manage/monitor/autoops.md) +* Enable [logs and metrics](/deploy-manage/monitor/stack-monitoring/ece-ech-stack-monitoring.md). When logs and metrics are enabled, monitoring information is visible on {{kib}}'s [Stack Monitoring](../../deploy-manage/monitor/monitoring-data/visualizing-monitoring-data.md) page. -You might experience high CPU usage on specific data nodes or an entire [data tier](/manage-data/lifecycle/data-tiers.md) if traffic isn’t evenly distributed. This is known as [hot spotting](hotspotting.md). Hot spotting commonly occurs when read or write applications don’t evenly distribute requests across nodes, or when indices receiving heavy write activity, such as indices in the hot tier, have their shards concentrated on just one or a few nodes. + You can also enable the [CPU usage threshold alert](../../deploy-manage/monitor/monitoring-data/configure-stack-monitoring-alerts.md) to be notified about potential issues through email. -For details on diagnosing and resolving these issues, refer to [](hotspotting.md). +* From your deployment menu, view the [**Performance**](../../deploy-manage/monitor/access-performance-metrics-on-elastic-cloud.md) page. On this page, you can view two key metrics: -### Oversharding [high-cpu-usage-oversharding] + * **CPU usage**: Your deployment’s CPU usage, represented as a percentage. + * **CPU credits**: Your remaining CPU credits, measured in seconds of CPU time. -Oversharding occurs when a cluster has too many shards, often times caused by shards being smaller than optimal. While {{es}} doesn’t have a strict minimum shard size, an excessive number of small shards can negatively impact performance. Each shard consumes cluster resources because {{es}} must maintain metadata and manage shard states across all nodes. -If you have too many small shards, you can address this by doing the following: +{{ech}} grants [CPU credits](/deploy-manage/deploy/elastic-cloud/ec-vcpu-boost-instance.md) per deployment to provide smaller clusters with performance boosts when needed. High CPU usage can deplete these credits, which might lead to [performance degradation](../monitoring/performance.md) and [increased cluster response times](../monitoring/cluster-response-time.md). +:::::: -* Removing empty or unused indices. -* Deleting or closing indices containing outdated or unnecessary data. -* Reindexing smaller shards into fewer, larger shards to optimize cluster performance. +::::::{applies-item} { self:, eck: } +* (Recommend) Enable [AutoOps](/deploy-manage/monitor/autoops.md) +* Enable [{{es}} monitoring](/deploy-manage/monitor/stack-monitoring.md). When logs and metrics are enabled, monitoring information is visible on {{kib}}'s [Stack Monitoring](../../deploy-manage/monitor/monitoring-data/visualizing-monitoring-data.md) page. -If your shards are sized correctly but you are still experiencing oversharding, creating a more aggressive [index lifecycle management strategy](/manage-data/lifecycle/index-lifecycle-management.md) or deleting old indices can help reduce the number of shards. + You can also enable the [CPU usage threshold alert](../../deploy-manage/monitor/monitoring-data/configure-stack-monitoring-alerts.md) to be notified about potential issues through email. +:::::: -For more information, refer to [](/deploy-manage/production-guidance/optimize-performance/size-shards.md). +::::::: -### Additional recommendations +You might also consider enabling [slow logs](elasticsearch://reference/elasticsearch/index-settings/slow-log.md) to review as part of the [task backlog](task-queue-backlog.md). -To further reduce CPU load or mitigate temporary spikes in resource usage, consider these steps: -#### Scale your cluster [scale-your-cluster] +## Reduce CPU usage [reduce-cpu-usage] -Heavy indexing and search loads can deplete smaller thread pools. Add nodes or upgrade existing ones to handle increased indexing and search loads more effectively. +High CPU usage usually correlates to live [expensive tasks or back-logged tasks](task-queue-backlog.md) running against the node. The following tips outline common causes and solutions for heightened CPU usage occurring even during periods of low or no traffic. -#### Spread out bulk requests [spread-out-bulk-requests] +### Oversharding [high-cpu-usage-oversharding] -Submit smaller [bulk indexing](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-bulk-1) or multi-search requests, and space them out to avoid overwhelming thread pools. +Oversharding occurs when a cluster has too many shards, often times caused by shards being smaller than optimal. We recommend the following best practices: -#### Cancel long-running searches [cancel-long-running-searches] +* [Aim for shards of up to 200M documents, or with sizes between 10GB and 50GB](/deploy-manage/production-guidance/optimize-performance/size-shards.md#shard-size-recommendation). +* [Master-eligible nodes should have at least 1GB of heap per 3000 indices](/deploy-manage/production-guidance/optimize-performance/size-shards.md#shard-count-recommendation). -Regularly use the [task management API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-tasks-list) to identify and cancel searches that consume excessive CPU time. +While {{es}} doesn’t have a strict minimum shard size, an excessive number of small shards can negatively impact performance. Each shard consumes cluster resources because {{es}} must maintain metadata and manage shard states across all nodes. -```console -GET _tasks?actions=*search&detailed -``` +If you have too many small shards, you can address this by doing the following: -The response’s `description` contains the search request and its queries. `running_time_in_nanos` shows how long the search has been running. - -```console-result -{ - "nodes" : { - "oTUltX4IQMOUUVeiohTt8A" : { - "name" : "my-node", - "transport_address" : "127.0.0.1:9300", - "host" : "127.0.0.1", - "ip" : "127.0.0.1:9300", - "tasks" : { - "oTUltX4IQMOUUVeiohTt8A:464" : { - "node" : "oTUltX4IQMOUUVeiohTt8A", - "id" : 464, - "type" : "transport", - "action" : "indices:data/read/search", - "description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]", - "start_time_in_millis" : 4081771730000, - "running_time_in_nanos" : 13991383, - "cancellable" : true - } - } - } - } -} -``` +* Removing empty or unused indices. +* Deleting or closing indices containing outdated or unnecessary data. +* Reindexing smaller shards into fewer, larger shards to optimize cluster performance. -To cancel a search and free up resources, use the API’s `_cancel` endpoint. +If your shards are sized correctly but you are still experiencing oversharding, creating a more aggressive [index lifecycle management strategy](/manage-data/lifecycle/index-lifecycle-management.md) or deleting old indices can help reduce the number of shards. -```console -POST _tasks/oTUltX4IQMOUUVeiohTt8A:464/_cancel -``` +### Overrode allocated processors [high-cpu-usage-allocated] + +By default, {{es}} allocates processors equal to the number reported available by the operating system. You can override this behaviour by adjusting the value of [`node.processors`](elasticsearch://reference/elasticsearch/configuration-reference/thread-pool-settings.md#node.processors), but this advanced setting should be configured only after you've performed load testing. -For additional tips on how to track and avoid resource-intensive searches, see [Avoid expensive searches](high-jvm-memory-pressure.md#avoid-expensive-searches). +{{ech}} supports [vCPU boosting](/deploy-manage/deploy/elastic-cloud/ec-vcpu-boost-instance.md) which should be relied on only for short bursting traffic and not for normal workload traffic. diff --git a/troubleshoot/elasticsearch/high-jvm-memory-pressure.md b/troubleshoot/elasticsearch/high-jvm-memory-pressure.md index 8f5bcd9dc1..9173081aa2 100644 --- a/troubleshoot/elasticsearch/high-jvm-memory-pressure.md +++ b/troubleshoot/elasticsearch/high-jvm-memory-pressure.md @@ -18,7 +18,7 @@ High JVM memory usage can degrade cluster performance and trigger [circuit break ## Diagnose high JVM memory pressure [diagnose-high-jvm-memory-pressure] -**Check JVM memory pressure** +### Check JVM memory pressure [diagnose-check-pressure] :::::::{applies-switch} @@ -49,7 +49,8 @@ JVM Memory Pressure = `used_in_bytes` / `max_in_bytes` :::::: ::::::: -**Check garbage collection logs** + +### Check garbage collection logs [diagnose-check-gc] As memory usage increases, garbage collection becomes more frequent and takes longer. You can track the frequency and length of garbage collection events in [`elasticsearch.log`](../../deploy-manage/monitor/logging-configuration/elasticsearch-log4j-configuration-self-managed.md). For example, the following event states {{es}} spent more than 50% (21 seconds) of the last 40 seconds performing garbage collection. @@ -57,10 +58,24 @@ As memory usage increases, garbage collection becomes more frequent and takes lo [timestamp_short_interval_from_last][INFO ][o.e.m.j.JvmGcMonitorService] [node_id] [gc][number] overhead, spent [21s] collecting in the last [40s] ``` -**Capture a JVM heap dump** +For optimal JVM performance, garbage collection (GC) should meet these criteria: + +| GC type | Completion time | Frequency | +|---------|----------------|---------------------| +| Young GC | <50ms | ~once per 10 seconds | +| Old GC | <1s | ≤once per 10 minutes | + + +### Capture a JVM heap dump [diagnose-check-dump] + +To determine the exact reason for the high JVM memory pressure, capture and review a heap dump of the JVM while its memory usage is high. -To determine the exact reason for the high JVM memory pressure, capture a heap dump of the JVM while its memory usage is high, and also capture the [garbage collector logs](elasticsearch://reference/elasticsearch/jvm-settings.md#gc-logging) covering the same time period. +If you have an [Elastic subscription](https://www.elastic.co/pricing), you can [request Elastic's assistance]](/troubleshoot.md#contact-us) reviewing this output. When doing so, kindly: +* Grant written permission for Elastic to review your uploaded heap dumps within the support case. +* Share this file only after receiving any necessary business approvals as it might contain private information. Files are handled according to [Elastic's privacy statement](https://www.elastic.co/legal/privacy-statement). +* Share heap dumps through our secure [Support Portal](https://support.elastic.co/). If your files are too large to upload, you can request a secure URL in the support case. +* Share the [garbage collector logs](elasticsearch://reference/elasticsearch/jvm-settings.md#gc-logging) covering the same time period. ## Reduce JVM memory pressure [reduce-jvm-memory-pressure] diff --git a/troubleshoot/elasticsearch/hotspotting.md b/troubleshoot/elasticsearch/hotspotting.md index d448e1690a..7b09e6e4a7 100644 --- a/troubleshoot/elasticsearch/hotspotting.md +++ b/troubleshoot/elasticsearch/hotspotting.md @@ -50,9 +50,12 @@ Historically, clusters experience hot spotting mainly as an effect of hardware, ### Hardware [causes-hardware] -Here are some common improper hardware setups which may contribute to hot spotting: +Here are some common improper hardware setups which might contribute to hot spotting: -* Resources are allocated non-uniformly. For example, if one hot node is given half the CPU of its peers. {{es}} expects all nodes on a [data tier](../../manage-data/lifecycle/data-tiers.md) to share the same hardware profiles or specifications. +* Resources are allocated non-uniformly. For example, if one hot node is given half the CPU of its peers. {{es}} expects all nodes on a [data tier](../../manage-data/lifecycle/data-tiers.md) to share the same hardware profiles or specifications. To check this, use the [cat nodes API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes): + ```console + GET _cat/nodes?v=true&s=name&h=name,role,disk.total,heap.max,allocated_processors + ``` * Resources are consumed by another service on the host, including other {{es}} nodes. Refer to our [dedicated host](../../deploy-manage/deploy/self-managed/installing-elasticsearch.md#dedicated-host) recommendation. * Resources experience different network or disk throughputs. For example, if one node’s I/O is lower than its peers. Refer to [Use faster hardware](../../deploy-manage/production-guidance/optimize-performance/indexing-speed.md#indexing-use-faster-hardware) for more information. * A JVM that has been configured with a heap larger than 31GB. Refer to [Set the JVM heap size](elasticsearch://reference/elasticsearch/jvm-settings.md#set-jvm-heap-size) for more information. diff --git a/troubleshoot/elasticsearch/task-queue-backlog.md b/troubleshoot/elasticsearch/task-queue-backlog.md index da2fbfe873..de98edcbab 100644 --- a/troubleshoot/elasticsearch/task-queue-backlog.md +++ b/troubleshoot/elasticsearch/task-queue-backlog.md @@ -23,6 +23,7 @@ To identify the cause of the backlog, try these diagnostic actions. * [Inspect hot threads on each node](#diagnose-task-queue-hot-thread) * [Identify long-running node tasks](#diagnose-task-queue-long-running-node-tasks) * [Look for long-running cluster tasks](#diagnose-task-queue-long-running-cluster-tasks) +* [Monitor slow logs](#diagnose-task-slow-logs) ### Check the thread pool status [diagnose-task-queue-thread-pool] @@ -88,6 +89,11 @@ GET /_cluster/pending_tasks Tasks with a high `timeInQueue` value are likely contributing to the backlog and might need to be [canceled](#resolve-task-queue-backlog-stuck-tasks). +### Monitor slow logs [diagnose-task-slow-logs] + +If you're not present during an incident to investigate backlogged tasks, you might consider enabling [slow logs](elasticsearch://reference/elasticsearch/index-settings/slow-log.md) to review later. + +For example, you can review slow search logs later using the [search profiler](elasticsearch://reference/elasticsearch/rest-apis/search-profile.md), so that time consuming requests can be optimized. ## Recommendations [resolve-task-queue-backlog] @@ -105,6 +111,47 @@ In some cases, you might need to increase the thread pool size. For example, the If an active task’s [hot thread](#diagnose-task-queue-hot-thread) shows no progress, consider [canceling the task](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks#task-cancellation). +For example, you can use the [task management API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-tasks-list) to identify and cancel searches that consume excessive CPU time. + +```console +GET _tasks?actions=*search&detailed +``` + +The response `description` contains the search request and its queries. The `running_time_in_nanos` parameter shows how long the search has been running. + +```console-result +{ + "nodes" : { + "oTUltX4IQMOUUVeiohTt8A" : { + "name" : "my-node", + "transport_address" : "127.0.0.1:9300", + "host" : "127.0.0.1", + "ip" : "127.0.0.1:9300", + "tasks" : { + "oTUltX4IQMOUUVeiohTt8A:464" : { + "node" : "oTUltX4IQMOUUVeiohTt8A", + "id" : 464, + "type" : "transport", + "action" : "indices:data/read/search", + "description" : "indices[my-index], search_type[QUERY_THEN_FETCH], source[{\"query\":...}]", + "start_time_in_millis" : 4081771730000, + "running_time_in_nanos" : 13991383, + "cancellable" : true + } + } + } + } +} +``` + +To cancel this example search to free up resources, you would run: + +```console +POST _tasks/oTUltX4IQMOUUVeiohTt8A:464/_cancel +``` + +For additional tips on how to track and avoid resource-intensive searches, see [Avoid expensive searches](high-jvm-memory-pressure.md#avoid-expensive-searches). + ### Address hot spotting [resolve-task-queue-backlog-hotspotting]