Skip to content

Commit f55a75b

Browse files
stefnestorkilfoyle
andauthored
Distinguish+expand Hot Spotting from Task Backlog (#4657)
## Summary Follow-up to #4592, when I originally wrote [hot spotting](https://www.elastic.co/guide/en/elasticsearch/reference/current/hotspotting.html) (elastic/elasticsearch#95429), we didn't have task queue backlog so some of its kind of content ended up over there. Undoes that and adds in content related to - (@mlliarm) https://support.elastic.co/knowledge/21401798 - (@rodrigomadalozzo) https://support.elastic.co/knowledge/d6496673 ## Generative AI disclosure 1. Did you use a generative AI (GenAI) tool to assist in creating this contribution? - [ ] Yes - [X ] No --------- Co-authored-by: David Kilfoyle <41695641+kilfoyle@users.noreply.github.com>
1 parent 871a5f6 commit f55a75b

File tree

2 files changed

+115
-66
lines changed

2 files changed

+115
-66
lines changed

troubleshoot/elasticsearch/hotspotting.md

Lines changed: 37 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,10 @@ Watch [this video](https://www.youtube.com/watch?v=Q5ODJ5nIKAM) for a walkthroug
2323

2424
## Detect hot spotting [detect]
2525

26+
To check for hot spotting you can investigate both active utilization levels and historical node statistics.
27+
28+
### Active [detect-active]
29+
2630
Hot spotting most commonly surfaces as significantly elevated resource utilization (of `disk.percent`, `heap.percent`, or `cpu`) among a subset of nodes as reported via [cat nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes). Individual spikes aren’t necessarily problematic, but if utilization repeatedly spikes or consistently remains high over time (for example longer than 30 seconds), the resource may be experiencing problematic hot spotting.
2731

2832
For example, let’s show case two separate plausible issues using cat nodes:
@@ -42,6 +46,38 @@ node_3 - hirstmv 25 90 10
4246

4347
Here we see two significantly unique utilizations: where the master node is at `cpu: 95` and a hot node is at `disk.used_percent: 90%`. This would indicate hot spotting was occurring on these two nodes, and not necessarily from the same root cause.
4448

49+
### Historical [detect-historical]
50+
51+
A secondary method to notice hot spotting build up is to poll the [node statistics API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-stats) for index-related performance metrics.
52+
53+
```console
54+
GET _nodes/stats?pretty=true&filter_path=nodes.*.name,nodes.*.roles,nodes.*.indices
55+
```
56+
57+
This request returns node operational metrics such as `query`, `refresh`, and `index`. It allows you to gauge:
58+
59+
* the total events attempted per node
60+
* the node's average processing time per event type
61+
62+
These metrics accumulate during the uptime for each individual node. To help view the output, you can parse the response using a third-party tool such as [JQ](https://jqlang.github.io/jq/):
63+
64+
```bash
65+
cat nodes_stats.json | jq -rc '.nodes[]|.name as $n|.roles as $r|.indices|to_entries[]|.key as $m|.value|select(.total and .total_time_in_millis)|select(.total>0)|{node:$n, roles:$r, metric:$m, total:.total, avg_millis:(.total_time_in_millis?/.total|round)}'
66+
```
67+
68+
The results indicate that multiple major operations are non-performant across nodes, suggesting that the cluster is likely under-provisioned. If a particular operation type or node stands out, it likely indicates [shard distribution issues](#causes-shards) which you might compare against [indices stats](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-stats).
69+
70+
```console
71+
GET /_stats?level=shards&human&expand_wildcards=all&ignore_unavailable=true
72+
```
73+
74+
These metrics accumulate from the history for each individual shard. You can parse this response with a tool like JQ to compare with earlier performance:
75+
76+
```bash
77+
cat indices_stats.json | jq -rc '.indices|to_entries[]|.key as $i|.value.shards[]|to_entries[]|.key as $sh|.value|.routing.primary as $p|.routing.node[:4] as $n|to_entries[]|.key as $m|.value|select(.total and .total_time_in_millis)|select(.total>0)|{index:$i, shard:$sh, primary:$p, node:$n, metric:$m, total:.total, avg_millis:(.total_time_in_millis/.total|round)}'
78+
```
79+
80+
4581

4682
## Causes [causes]
4783

@@ -149,36 +185,4 @@ cat shard_stats.json | jq -rc 'sort_by(-.avg_indexing)[]' | head
149185

150186
### Task loads [causes-tasks]
151187

152-
Shard distribution problems will most-likely surface as task load as seen above in the [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) example. It is also possible for tasks to hot spot a node either due to individual qualitative expensiveness or overall quantitative traffic loads.
153-
154-
For example, if [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) reported a high queue on the `warmer` [thread pool](elasticsearch://reference/elasticsearch/configuration-reference/thread-pool-settings.md), you would look-up the effected node’s [hot threads](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads). Let’s say it reported `warmer` threads at `100% cpu` related to `GlobalOrdinalsBuilder`. This would let you know to inspect [field data’s global ordinals](elasticsearch://reference/elasticsearch/mapping-reference/eager-global-ordinals.md).
155-
156-
Alternatively, let’s say [cat nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes) shows a hot spotted master node and [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) shows general queuing across nodes. This would suggest the master node is overwhelmed. To resolve this, first ensure [hardware high availability](../../deploy-manage/production-guidance/availability-and-resilience/resilience-in-small-clusters.md) setup and then look to ephemeral causes. In this example, [the nodes hot threads API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads) reports multiple threads in `other` which indicates they’re waiting on or blocked by either garbage collection or I/O.
157-
158-
For either of these example situations, a good way to confirm the problematic tasks is to look at longest running non-continuous (designated `[c]`) tasks via [cat task management](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-tasks). This can be supplemented checking longest running cluster sync tasks via [cat pending tasks](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-pending-tasks). Using a third example,
159-
160-
```console
161-
GET _cat/tasks?v&s=time:desc&h=type,action,running_time,node,cancellable
162-
```
163-
164-
This could return:
165-
166-
```console-result
167-
type action running_time node cancellable
168-
direct indices:data/read/eql 10m node_1 true
169-
...
170-
```
171-
172-
This surfaces a problematic [EQL query](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-eql-search). We can gain further insight on it via [the task management API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks),
173-
174-
```console
175-
GET _tasks?human&detailed
176-
```
177-
178-
Its response contains a `description` that reports this query:
179-
180-
```eql
181-
indices[winlogbeat-*,logs-window*], sequence by winlog.computer_name with maxspan=1m\n\n[authentication where host.os.type == "windows" and event.action:"logged-in" and\n event.outcome == "success" and process.name == "svchost.exe" ] by winlog.event_data.TargetLogonId
182-
```
183-
184-
This lets you know which indices to check (`winlogbeat-*,logs-window*`), as well as the [EQL search](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-eql-search) request body. Most likely this is [SIEM related](/solutions/security.md). You can combine this with [audit logging](../../deploy-manage/security/logging-configuration/enabling-audit-logs.md) as needed to trace the request source.
188+
Shard distribution problems will most-likely surface as task load, as seen above in the [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) example. It is also possible for tasks to hot spot a node due either to individual qualitative expensiveness or to overall quantitative traffic loads, which will surface in [backlogged tasks](/troubleshoot/elasticsearch/task-queue-backlog.md).

0 commit comments

Comments
 (0)