You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: troubleshoot/elasticsearch/hotspotting.md
+37-33Lines changed: 37 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,6 +23,10 @@ Watch [this video](https://www.youtube.com/watch?v=Q5ODJ5nIKAM) for a walkthroug
23
23
24
24
## Detect hot spotting [detect]
25
25
26
+
To check for hot spotting you can investigate both active utilization levels and historical node statistics.
27
+
28
+
### Active [detect-active]
29
+
26
30
Hot spotting most commonly surfaces as significantly elevated resource utilization (of `disk.percent`, `heap.percent`, or `cpu`) among a subset of nodes as reported via [cat nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes). Individual spikes aren’t necessarily problematic, but if utilization repeatedly spikes or consistently remains high over time (for example longer than 30 seconds), the resource may be experiencing problematic hot spotting.
27
31
28
32
For example, let’s show case two separate plausible issues using cat nodes:
@@ -42,6 +46,38 @@ node_3 - hirstmv 25 90 10
42
46
43
47
Here we see two significantly unique utilizations: where the master node is at `cpu: 95` and a hot node is at `disk.used_percent: 90%`. This would indicate hot spotting was occurring on these two nodes, and not necessarily from the same root cause.
44
48
49
+
### Historical [detect-historical]
50
+
51
+
A secondary method to notice hot spotting build up is to poll the [node statistics API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-stats) for index-related performance metrics.
52
+
53
+
```console
54
+
GET _nodes/stats?pretty=true&filter_path=nodes.*.name,nodes.*.roles,nodes.*.indices
55
+
```
56
+
57
+
This request returns node operational metrics such as `query`, `refresh`, and `index`. It allows you to gauge:
58
+
59
+
* the total events attempted per node
60
+
* the node's average processing time per event type
61
+
62
+
These metrics accumulate during the uptime for each individual node. To help view the output, you can parse the response using a third-party tool such as [JQ](https://jqlang.github.io/jq/):
63
+
64
+
```bash
65
+
cat nodes_stats.json | jq -rc '.nodes[]|.name as $n|.roles as $r|.indices|to_entries[]|.key as $m|.value|select(.total and .total_time_in_millis)|select(.total>0)|{node:$n, roles:$r, metric:$m, total:.total, avg_millis:(.total_time_in_millis?/.total|round)}'
66
+
```
67
+
68
+
The results indicate that multiple major operations are non-performant across nodes, suggesting that the cluster is likely under-provisioned. If a particular operation type or node stands out, it likely indicates [shard distribution issues](#causes-shards) which you might compare against [indices stats](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-stats).
69
+
70
+
```console
71
+
GET /_stats?level=shards&human&expand_wildcards=all&ignore_unavailable=true
72
+
```
73
+
74
+
These metrics accumulate from the history for each individual shard. You can parse this response with a tool like JQ to compare with earlier performance:
75
+
76
+
```bash
77
+
cat indices_stats.json | jq -rc '.indices|to_entries[]|.key as $i|.value.shards[]|to_entries[]|.key as $sh|.value|.routing.primary as $p|.routing.node[:4] as $n|to_entries[]|.key as $m|.value|select(.total and .total_time_in_millis)|select(.total>0)|{index:$i, shard:$sh, primary:$p, node:$n, metric:$m, total:.total, avg_millis:(.total_time_in_millis/.total|round)}'
Shard distribution problems will most-likely surface as task load as seen above in the [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) example. It is also possible for tasks to hot spot a node either due to individual qualitative expensiveness or overall quantitative traffic loads.
153
-
154
-
For example, if [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) reported a high queue on the `warmer`[thread pool](elasticsearch://reference/elasticsearch/configuration-reference/thread-pool-settings.md), you would look-up the effected node’s [hot threads](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads). Let’s say it reported `warmer` threads at `100% cpu` related to `GlobalOrdinalsBuilder`. This would let you know to inspect [field data’s global ordinals](elasticsearch://reference/elasticsearch/mapping-reference/eager-global-ordinals.md).
155
-
156
-
Alternatively, let’s say [cat nodes](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-nodes) shows a hot spotted master node and [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) shows general queuing across nodes. This would suggest the master node is overwhelmed. To resolve this, first ensure [hardware high availability](../../deploy-manage/production-guidance/availability-and-resilience/resilience-in-small-clusters.md) setup and then look to ephemeral causes. In this example, [the nodes hot threads API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-nodes-hot-threads) reports multiple threads in `other` which indicates they’re waiting on or blocked by either garbage collection or I/O.
157
-
158
-
For either of these example situations, a good way to confirm the problematic tasks is to look at longest running non-continuous (designated `[c]`) tasks via [cat task management](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-tasks). This can be supplemented checking longest running cluster sync tasks via [cat pending tasks](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-pending-tasks). Using a third example,
159
-
160
-
```console
161
-
GET _cat/tasks?v&s=time:desc&h=type,action,running_time,node,cancellable
162
-
```
163
-
164
-
This could return:
165
-
166
-
```console-result
167
-
type action running_time node cancellable
168
-
direct indices:data/read/eql 10m node_1 true
169
-
...
170
-
```
171
-
172
-
This surfaces a problematic [EQL query](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-eql-search). We can gain further insight on it via [the task management API](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks),
173
-
174
-
```console
175
-
GET _tasks?human&detailed
176
-
```
177
-
178
-
Its response contains a `description` that reports this query:
179
-
180
-
```eql
181
-
indices[winlogbeat-*,logs-window*], sequence by winlog.computer_name with maxspan=1m\n\n[authentication where host.os.type == "windows" and event.action:"logged-in" and\n event.outcome == "success" and process.name == "svchost.exe" ] by winlog.event_data.TargetLogonId
182
-
```
183
-
184
-
This lets you know which indices to check (`winlogbeat-*,logs-window*`), as well as the [EQL search](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-eql-search) request body. Most likely this is [SIEM related](/solutions/security.md). You can combine this with [audit logging](../../deploy-manage/security/logging-configuration/enabling-audit-logs.md) as needed to trace the request source.
188
+
Shard distribution problems will most-likely surface as task load, as seen above in the [cat thread pool](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-cat-thread-pool) example. It is also possible for tasks to hot spot a node due either to individual qualitative expensiveness or to overall quantitative traffic loads, which will surface in [backlogged tasks](/troubleshoot/elasticsearch/task-queue-backlog.md).
0 commit comments