elastic · stefnestor · Sep 12, 2024 · Aug 10, 2024 · Aug 10, 2024 · Aug 10, 2024
diff --git a/docs/reference/esql/task-management.asciidoc b/docs/reference/esql/task-management.asciidoc
@@ -9,7 +9,7 @@ You can list running {esql} queries with the <<tasks,task management API>>:
 
 [source,console,id=esql-task-management-get-all]
 ----
-GET /_tasks?pretty&detailed&group_by=parents&human&actions=*data/read/esql
+GET /_tasks?pretty=true&human=true&detailed=true&group_by=parents&actions=*data/read/esql
 ----
 
 Which returns a list of statuses like this:

diff --git a/docs/reference/modules/indices/circuit_breaker.asciidoc b/docs/reference/modules/indices/circuit_breaker.asciidoc
@@ -175,7 +175,8 @@ an `OutOfMemory` exception which would bring down the node.
 To prevent this from happening, a special <<circuit-breaker, circuit breaker>> is used,
 which limits the memory allocation during the execution of a <<eql-sequences, sequence>>
 query. When the breaker is triggered, an `org.elasticsearch.common.breaker.CircuitBreakingException`
-is thrown and a descriptive error message is returned to the user.
+is thrown and a descriptive error message including `circuit_breaking_exception`
+is returned to the user.
 
 This <<circuit-breaker, circuit breaker>> can be configured using the following settings:
 

diff --git a/docs/reference/tab-widgets/cpu-usage.asciidoc b/docs/reference/tab-widgets/cpu-usage.asciidoc
@@ -1,30 +1,29 @@
 // tag::cloud[]
-From your deployment menu, click **Performance**. The page's **CPU Usage** chart
-shows your deployment's CPU usage as a percentage.
-
-High CPU usage can also deplete your CPU credits. CPU credits let {ess} provide
-smaller clusters with a performance boost when needed. The **CPU credits**
-chart shows your remaining CPU credits, measured in seconds of CPU time.
-
-You can also use the <<cat-nodes,cat nodes API>> to get the current CPU usage
-for each node.
-
-// tag::cpu-usage-cat-nodes[]
-[source,console]
-----
-GET _cat/nodes?v=true&s=cpu:desc
-----
-
-The response's `cpu` column contains the current CPU usage as a percentage. The
-`name` column contains the node's name.
-// end::cpu-usage-cat-nodes[]
-
+* (Recommended) Enabling {cloud}/ec-monitoring-setup.html[Logs and Metrics]. Data will then
+report under {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring]. We
+recommend enabling its {kibana-ref}/kibana-alerts.html[CPU Usage Threshold Alert]
+to be proactively notified about potential issues.
+
+* From your deployment menu, clicking into
+{cloud}/ec-saas-metrics-accessing.html[**Performance**]. This page's **CPU
+Usage** chart shows your deployment's CPU usage as a percentage. The page's
+**CPU credits** chart shows your remaining CPU credits, measured in seconds of
+CPU time.
+
+{ess} grants {cloud}/ec-vcpu-boost-instance.html[CPU credits] per deployment
+to provide smaller clusters with performance boosts when needed. High CPU
+usage can deplete these credits which may lead to symptoms like:
+
+* {cloud}/ec-scenario_why_is_performance_degrading_over_time.html[Why is
+performance degrading over time?].
+
+* {cloud}/ec-scenario_why_are_my_cluster_response_times_suddenly_so_much_worse.html[Why
+are my cluster response times suddenly so much worse?]
 // end::cloud[]
 
 // tag::self-managed[]
-
-Use the <<cat-nodes,cat nodes API>> to get the current CPU usage for each node.
-
-include::cpu-usage.asciidoc[tag=cpu-usage-cat-nodes]
-
+* Enabling <<monitoring-overview,{es} Monitoring>>. Data will then
+report under {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring]. We
+recommend enabling its {kibana-ref}/kibana-alerts.html[CPU Usage Threshold Alert]
+to be proactively notified about potential issues.
 // end::self-managed[]
diff --git a/docs/reference/transform/troubleshooting.asciidoc b/docs/reference/transform/troubleshooting.asciidoc
@@ -20,7 +20,7 @@ by your `transform_id`.
 information about the {transform} status and failures.
 * If the {transform} exists as a task, you can use the
 <<tasks,task management API>> to gather task information. For example:
-`GET _tasks?actions=data_frame/transforms*&detailed`. Typically, the task exists
+`GET _tasks?actions=data_frame/transforms*&detailed=true`. Typically, the task exists
 when the {transform} is in a started or failed state.
 * The {es} logs from the node that was running the {transform} might
 also contain useful information. You can identify the node from the notification

diff --git a/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc b/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc
@@ -9,12 +9,30 @@ If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
 related to the thread pool. For example, if the `search` thread pool is
 depleted, {es} will reject search requests until more threads are available.
 
+CPU degradation frequently occurs related to a <<data-tiers,data tier>>'s traffic,
+potentially being <<hotspotting,hot spotted>>.
+
 [discrete]
 [[diagnose-high-cpu-usage]]
 ==== Diagnose high CPU usage
 
 **Check CPU usage**
 
+Current CPU usage per node can be polled from the <<cat-nodes,cat nodes API>>:
+
+// tag::cpu-usage-cat-nodes[]
+[source,console]
+----
+GET _cat/nodes?v=true&s=cpu:desc
+----
+
+The response's `cpu` column contains the current CPU usage as a percentage.
+The `name` column contains the node's name. Elevated but transient `cpu` is
+normal, but if `cpu` is elevated for an extended duration it should be
+investigated.
+
+To track CPU usage over time, we recommend enabling monitoring:
+
 include::{es-ref-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
 
 **Check hot threads**
@@ -24,11 +42,13 @@ threads API>> to check for resource-intensive threads running on the node.
 
 [source,console]
 ----
-GET _nodes/my-node,my-other-node/hot_threads
+GET _nodes/hot_threads
 ----
 // TEST[s/\/my-node,my-other-node//]
 
-This API returns a breakdown of any hot threads in plain text.
+This API returns a breakdown of any hot threads in plain text. High CPU usage
+frequently correlates to <<task-queue-backlog,particular tasks and/or their
+backlog>>.
 
 [discrete]
 [[reduce-cpu-usage]]
@@ -56,7 +76,7 @@ for these searches, use the <<tasks,task management API>>.
 
 [source,console]
 ----
-GET _tasks?actions=*search&detailed
+GET _tasks?actions=*search&detailed=true
 ----
 
 The response's `description` contains the search request and its queries.

diff --git a/docs/reference/troubleshooting/common-issues/hotspotting.asciidoc b/docs/reference/troubleshooting/common-issues/hotspotting.asciidoc
@@ -263,7 +263,7 @@ further insight on it via <<tasks,the task management API>>,
 
 [source,console]
 ----
-GET _tasks?human&detailed
+GET _tasks?pretty=true&human=true&detailed=true
 ----
 
 Its response contains a `description` that reports this query:

diff --git a/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc b/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc
@@ -23,9 +23,50 @@ To check the number of rejected tasks for each thread pool, use the
 
 [source,console]
 ----
-GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
+GET /_cat/thread_pool?v=true&h=id,name,queue,active,rejected,completed
 ----
 
+The `write` thread pool rejections frequently surface in the erring API and
+correlating log as `EsRejectedExecutionException` with either
+`QueueResizingEsThreadPoolExecutor` or `queue capacity`.
+
+This frequently relates to <<task-queue-backlog,backlogged tasks>>
+
+[discrete]
+[[check-circuit-breakers]]
+==== Check circuit breakers
+
+To check the number of tripped <<circuit-breaker,circuit breakers>>, use the
+<<cluster-nodes-stats,node stats API>>.
+
+[source,console]
+----
+GET /_nodes/stats/breaker
+----
+
+These statistics are cumulative from node start up. For more information, see
+<<circuit_breaker,circuit breaker errors>>.
+
+[discrete]
+[[check-indexing-pressure]]
+==== Check indexing pressure
+
+To check the number of <<index-modules-indexing-pressure,indexing pressure>>
+rejections, use the <<cluster-nodes-stats,node stats API>>
+
+[source,console]
+----
+GET _nodes/stats?human=true&filter_path=nodes.*.indexing_pressure
+----
+
+The statistics are cumulative from node start up. Related API errors would
+include `EsRejectedExecutionException` sub sections calling out rejected due
+to `coordinating_and_primary_bytes`, `coordinating`, `primary`, or `replica`.
+
+This frequently relates to <<task-queue-backlog,backlogged tasks>>,
+<<docs-bulk,bulk index>> sizing, and/or the ingest target's
+<<index-modules,`refresh_interval` setting>>.
+
 [discrete]
 [[prevent-rejected-requests]]
 ==== Prevent rejected requests
@@ -34,9 +75,4 @@ GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
 
 If {es} regularly rejects requests and other tasks, your cluster likely has high
 CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
-<<high-jvm-memory-pressure>>.
-
-**Prevent circuit breaker errors**
-
-If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
-for tips on diagnosing and preventing them.
+<<high-jvm-memory-pressure>>.
diff --git a/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc b/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc
@@ -1,51 +1,90 @@
 [[task-queue-backlog]]
 === Task queue backlog
 
-A backlogged task queue can prevent tasks from completing and 
-put the cluster into an unhealthy state. 
-Resource constraints, a large number of tasks being triggered at once,
-and long running tasks can all contribute to a backlogged task queue.
+A backlogged task queue can prevent tasks from completing and put the cluster
+into an unhealthy state. Resource constraints, a large number of tasks being
+triggered at once, and long running tasks can all contribute to a backlogged
+task queue.
 
 [discrete]
 [[diagnose-task-queue-backlog]]
 ==== Diagnose a task queue backlog
 
 **Check the thread pool status**
 
-A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>. 
+A <<high-cpu-usage,depleted thread pool>> can result in
+<<rejected-requests,rejected requests>>. This may surface restricted to a
+<<data-tiers,data tier>>'s traffic, potentially with <<hotspotting,hot spotting>>
+symptoms.
 
-You can use the <<cat-thread-pool,cat thread pool API>> to 
-see the number of active threads in each thread pool and
-how many tasks are queued, how many have been rejected, and how many have completed. 
+You can use the <<cat-thread-pool,cat thread pool API>> to see the number of
+active threads in each thread pool and how many tasks are queued, how many
+have been rejected, and how many have completed.
 
 [source,console]
 ----
 GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
 ----
 
+The `active` and `queue` statistics are instantaneous while the `rejected` and
+`completed` statistics are cumulative from node start up.
+
 **Inspect the hot threads on each node**
 
-If a particular thread pool queue is backed up, 
-you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API 
-to determine if the thread has sufficient 
-resources to progress and gauge how quickly it is progressing.
+If a particular thread pool queue is backed up, you can periodically poll the
+<<cluster-nodes-hot-threads,Nodes hot threads>> API to determine if the thread
+has sufficient resources to progress and gauge how quickly it is progressing.
 
 [source,console]
 ----
 GET /_nodes/hot_threads
 ----
 
-**Look for long running tasks**
+**Look for long running node tasks**
+
+Long-running tasks can also cause a backlog. You can use the <<tasks,task
+management>> API to get information about the node tasks that are running.
+Check the `running_time_in_nanos` to identify tasks that are taking an
+excessive amount of time to complete.
+
+[source,console]
+----
+GET /_tasks?pretty=true&human=true&detailed=true
+----
+
+If a particular `action` is suspected, you can filter in further. Most common are: 
+
+* <<docs-bulk,bulk index>> related
++
+[source,console]
+----
+GET /_tasks?pretty=true&human=true&detailed=true&actions=indices:data/write/bulk
+----
+
+* search related
++
+[source,console]
+----
+GET /_tasks?pretty=true&human=true&detailed=true&actions=indices:data/write/search
+----
+
+Note the API response may contain tasks columns `description` and `header` 
+which enable futher diagnosising task parameters, target, and requestor. 
 
-Long-running tasks can also cause a backlog. 
-You can use the <<tasks,task management>> API to get information about the tasks that are running. 
-Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete. 
+**Look for long running cluster tasks**
+
+Back up may also surface as a delay in synchronizing the cluster state. You
+can use the <<cat-pending-tasks,cat pending tasks API>> to get information
+about the pending cluster state sync tasks that are running. 
 
 [source,console]
 ----
-GET /_tasks?filter_path=nodes.*.tasks
+GET /_cat/pending_tasks?v=true
 ----
 
+Check the `timeInQueue` to identify tasks that are taking an excessive amount 
+of time to complete.
+
 [discrete]
 [[resolve-task-queue-backlog]]
 ==== Resolve a task queue backlog