Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/reference/esql/task-management.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ You can list running {esql} queries with the <<tasks,task management API>>:

[source,console,id=esql-task-management-get-all]
----
GET /_tasks?pretty&detailed&group_by=parents&human&actions=*data/read/esql
GET /_tasks?pretty=true&human=true&detailed=true&group_by=parents&actions=*data/read/esql
----

Which returns a list of statuses like this:
Expand Down
3 changes: 2 additions & 1 deletion docs/reference/modules/indices/circuit_breaker.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,8 @@ an `OutOfMemory` exception which would bring down the node.
To prevent this from happening, a special <<circuit-breaker, circuit breaker>> is used,
which limits the memory allocation during the execution of a <<eql-sequences, sequence>>
query. When the breaker is triggered, an `org.elasticsearch.common.breaker.CircuitBreakingException`
is thrown and a descriptive error message is returned to the user.
is thrown and a descriptive error message including `circuit_breaking_exception`
is returned to the user.

This <<circuit-breaker, circuit breaker>> can be configured using the following settings:

Expand Down
49 changes: 24 additions & 25 deletions docs/reference/tab-widgets/cpu-usage.asciidoc
Original file line number Diff line number Diff line change
@@ -1,30 +1,29 @@
// tag::cloud[]
From your deployment menu, click **Performance**. The page's **CPU Usage** chart
shows your deployment's CPU usage as a percentage.

High CPU usage can also deplete your CPU credits. CPU credits let {ess} provide
smaller clusters with a performance boost when needed. The **CPU credits**
chart shows your remaining CPU credits, measured in seconds of CPU time.

You can also use the <<cat-nodes,cat nodes API>> to get the current CPU usage
for each node.

// tag::cpu-usage-cat-nodes[]
[source,console]
----
GET _cat/nodes?v=true&s=cpu:desc
----

The response's `cpu` column contains the current CPU usage as a percentage. The
`name` column contains the node's name.
// end::cpu-usage-cat-nodes[]

* (Recommended) Enabling {cloud}/ec-monitoring-setup.html[Logs and Metrics]. Data will then
report under {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring]. We
recommend enabling its {kibana-ref}/kibana-alerts.html[CPU Usage Threshold Alert]
to be proactively notified about potential issues.
* From your deployment menu, clicking into
{cloud}/ec-saas-metrics-accessing.html[**Performance**]. This page's **CPU
Usage** chart shows your deployment's CPU usage as a percentage. The page's
**CPU credits** chart shows your remaining CPU credits, measured in seconds of
CPU time.
{ess} grants {cloud}/ec-vcpu-boost-instance.html[CPU credits] per deployment
to provide smaller clusters with performance boosts when needed. High CPU
usage can deplete these credits which may lead to symptoms like:
* {cloud}/ec-scenario_why_is_performance_degrading_over_time.html[Why is
performance degrading over time?].
* {cloud}/ec-scenario_why_are_my_cluster_response_times_suddenly_so_much_worse.html[Why
are my cluster response times suddenly so much worse?]
// end::cloud[]
// tag::self-managed[]

Use the <<cat-nodes,cat nodes API>> to get the current CPU usage for each node.

include::cpu-usage.asciidoc[tag=cpu-usage-cat-nodes]

* Enabling <<monitoring-overview,{es} Monitoring>>. Data will then
report under {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring]. We
recommend enabling its {kibana-ref}/kibana-alerts.html[CPU Usage Threshold Alert]
to be proactively notified about potential issues.
// end::self-managed[]
2 changes: 1 addition & 1 deletion docs/reference/transform/troubleshooting.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ by your `transform_id`.
information about the {transform} status and failures.
* If the {transform} exists as a task, you can use the
<<tasks,task management API>> to gather task information. For example:
`GET _tasks?actions=data_frame/transforms*&detailed`. Typically, the task exists
`GET _tasks?actions=data_frame/transforms*&detailed=true`. Typically, the task exists
when the {transform} is in a started or failed state.
* The {es} logs from the node that was running the {transform} might
also contain useful information. You can identify the node from the notification
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,30 @@ If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
related to the thread pool. For example, if the `search` thread pool is
depleted, {es} will reject search requests until more threads are available.

CPU degradation frequently occurs related to a <<data-tiers,data tier>>'s traffic,
potentially being <<hotspotting,hot spotted>>.

[discrete]
[[diagnose-high-cpu-usage]]
==== Diagnose high CPU usage

**Check CPU usage**

Current CPU usage per node can be polled from the <<cat-nodes,cat nodes API>>:

// tag::cpu-usage-cat-nodes[]
[source,console]
----
GET _cat/nodes?v=true&s=cpu:desc
----

The response's `cpu` column contains the current CPU usage as a percentage.
The `name` column contains the node's name. Elevated but transient `cpu` is
normal, but if `cpu` is elevated for an extended duration it should be
investigated.

To track CPU usage over time, we recommend enabling monitoring:

include::{es-ref-dir}/tab-widgets/cpu-usage-widget.asciidoc[]

**Check hot threads**
Expand All @@ -24,11 +42,13 @@ threads API>> to check for resource-intensive threads running on the node.

[source,console]
----
GET _nodes/my-node,my-other-node/hot_threads
GET _nodes/hot_threads
----
// TEST[s/\/my-node,my-other-node//]

This API returns a breakdown of any hot threads in plain text.
This API returns a breakdown of any hot threads in plain text. High CPU usage
frequently correlates to <<task-queue-backlog,particular tasks and/or their
backlog>>.

[discrete]
[[reduce-cpu-usage]]
Expand Down Expand Up @@ -56,7 +76,7 @@ for these searches, use the <<tasks,task management API>>.

[source,console]
----
GET _tasks?actions=*search&detailed
GET _tasks?actions=*search&detailed=true
----

The response's `description` contains the search request and its queries.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ further insight on it via <<tasks,the task management API>>,

[source,console]
----
GET _tasks?human&detailed
GET _tasks?pretty=true&human=true&detailed=true
----

Its response contains a `description` that reports this query:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,50 @@ To check the number of rejected tasks for each thread pool, use the

[source,console]
----
GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
GET /_cat/thread_pool?v=true&h=id,name,queue,active,rejected,completed
----

The `write` thread pool rejections frequently surface in the erring API and
correlating log as `EsRejectedExecutionException` with either
`QueueResizingEsThreadPoolExecutor` or `queue capacity`.

This frequently relates to <<task-queue-backlog,backlogged tasks>>

[discrete]
[[check-circuit-breakers]]
==== Check circuit breakers

To check the number of tripped <<circuit-breaker,circuit breakers>>, use the
<<cluster-nodes-stats,node stats API>>.

[source,console]
----
GET /_nodes/stats/breaker
----

These statistics are cumulative from node start up. For more information, see
<<circuit_breaker,circuit breaker errors>>.

[discrete]
[[check-indexing-pressure]]
==== Check indexing pressure

To check the number of <<index-modules-indexing-pressure,indexing pressure>>
rejections, use the <<cluster-nodes-stats,node stats API>>

[source,console]
----
GET _nodes/stats?human=true&filter_path=nodes.*.indexing_pressure
----

The statistics are cumulative from node start up. Related API errors would
include `EsRejectedExecutionException` sub sections calling out rejected due
to `coordinating_and_primary_bytes`, `coordinating`, `primary`, or `replica`.

This frequently relates to <<task-queue-backlog,backlogged tasks>>,
<<docs-bulk,bulk index>> sizing, and/or the ingest target's
<<index-modules,`refresh_interval` setting>>.

[discrete]
[[prevent-rejected-requests]]
==== Prevent rejected requests
Expand All @@ -34,9 +75,4 @@ GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed

If {es} regularly rejects requests and other tasks, your cluster likely has high
CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
<<high-jvm-memory-pressure>>.

**Prevent circuit breaker errors**

If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
for tips on diagnosing and preventing them.
<<high-jvm-memory-pressure>>.
Original file line number Diff line number Diff line change
@@ -1,51 +1,90 @@
[[task-queue-backlog]]
=== Task queue backlog

A backlogged task queue can prevent tasks from completing and
put the cluster into an unhealthy state.
Resource constraints, a large number of tasks being triggered at once,
and long running tasks can all contribute to a backlogged task queue.
A backlogged task queue can prevent tasks from completing and put the cluster
into an unhealthy state. Resource constraints, a large number of tasks being
triggered at once, and long running tasks can all contribute to a backlogged
task queue.

[discrete]
[[diagnose-task-queue-backlog]]
==== Diagnose a task queue backlog

**Check the thread pool status**

A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.
A <<high-cpu-usage,depleted thread pool>> can result in
<<rejected-requests,rejected requests>>. This may surface restricted to a
<<data-tiers,data tier>>'s traffic, potentially with <<hotspotting,hot spotting>>
symptoms.

You can use the <<cat-thread-pool,cat thread pool API>> to
see the number of active threads in each thread pool and
how many tasks are queued, how many have been rejected, and how many have completed.
You can use the <<cat-thread-pool,cat thread pool API>> to see the number of
active threads in each thread pool and how many tasks are queued, how many
have been rejected, and how many have completed.

[source,console]
----
GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
----

The `active` and `queue` statistics are instantaneous while the `rejected` and
`completed` statistics are cumulative from node start up.

**Inspect the hot threads on each node**

If a particular thread pool queue is backed up,
you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
to determine if the thread has sufficient
resources to progress and gauge how quickly it is progressing.
If a particular thread pool queue is backed up, you can periodically poll the
<<cluster-nodes-hot-threads,Nodes hot threads>> API to determine if the thread
has sufficient resources to progress and gauge how quickly it is progressing.

[source,console]
----
GET /_nodes/hot_threads
----

**Look for long running tasks**
**Look for long running node tasks**

Long-running tasks can also cause a backlog. You can use the <<tasks,task
management>> API to get information about the node tasks that are running.
Check the `running_time_in_nanos` to identify tasks that are taking an
excessive amount of time to complete.

[source,console]
----
GET /_tasks?pretty=true&human=true&detailed=true
----

If a particular `action` is suspected, you can filter in further. Most common are:

* <<docs-bulk,bulk index>> related
+
[source,console]
----
GET /_tasks?pretty=true&human=true&detailed=true&actions=indices:data/write/bulk
----

* search related
+
[source,console]
----
GET /_tasks?pretty=true&human=true&detailed=true&actions=indices:data/write/search
----

Note the API response may contain tasks columns `description` and `header`
which enable futher diagnosising task parameters, target, and requestor.

Long-running tasks can also cause a backlog.
You can use the <<tasks,task management>> API to get information about the tasks that are running.
Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.
**Look for long running cluster tasks**

Back up may also surface as a delay in synchronizing the cluster state. You
can use the <<cat-pending-tasks,cat pending tasks API>> to get information
about the pending cluster state sync tasks that are running.

[source,console]
----
GET /_tasks?filter_path=nodes.*.tasks
GET /_cat/pending_tasks?v=true
----

Check the `timeInQueue` to identify tasks that are taking an excessive amount
of time to complete.

[discrete]
[[resolve-task-queue-backlog]]
==== Resolve a task queue backlog
Expand Down