Use the last good NodeUsageStatsForThreadPools when a node returns an error #133896

nicktindall · 2025-09-01T04:06:35Z

When a node returns an error response to the NodeUsageStatsForThreadPoolsCollector, use the most recent good value we've seen for that node, rather than returning nothing.

I believe this is the more important scenario to cover, I don't think we need to do anything special for nodes with no NodeUsageStatsForThreadPools value in the ClusterInfo because that situation should be very brief, because we refresh our ClusterInfo eagerly when a new node joins the cluster.

Relates: ES-12621

… error

nicktindall · 2025-09-01T04:18:41Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

+            if (nodeUsageStatsForThreadPools != null) {
+                cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools);
+            }
+        }


Not sure whether it makes sense to cache these things forever, or put some limit on how long we consider them to be better than nothing. I can't imagine being part of the cluster, but returning errors for node usage stats requests is a situation that persists for very long.

Only the last seen value for each node is cached. That doesn't seem expensive to save and it'll be refreshed frequently.

A WARN message log for each node that fails to respond would be good, along with its error cause/msg. It shouldn't happen often, so I don't expect it'll be noisy.

Yeah I was more worried about whether there's a point where the cached value is so stale it's not useful, but I think it's probably always better than nothing.

Yes, always better than nothing 👍

A a cluster where a node is failing repeatedly to return stats probably has much bigger problems than this stale value.

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

DiannaHohensee

These changes seem on track to me 👍

Have you had a chance to explore the ClusterInfoService and ClusterState updates, whether it's possible for those two pieces of state passed into a balancing computation to be out of sync in regards to newly added nodes? I didn't investigate, but I was wondering whether that was possible. That could leave the balancer looking up a node ID that doesn't exist in the nodeUsageStats (was never fetched). A newly removed node probably can't do any harm, since the nodeID would never be looked up in the nodeUsageStats. Ah, I see you've opened #133901, too. I missed that initially. All set.

DiannaHohensee · 2025-09-02T20:37:45Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

+            if (nodeUsageStatsForThreadPools != null) {
+                cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools);
+            }
+        }


Only the last seen value for each node is cached. That doesn't seem expensive to save and it'll be refreshed frequently.

A WARN message log for each node that fails to respond would be good, along with its error cause/msg. It shouldn't happen often, so I don't expect it'll be noisy.

DiannaHohensee · 2025-09-02T20:40:28Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

+        for (FailedNodeException failedNodeException : response.failures()) {
+            final var nodeUsageStatsForThreadPools = lastNodeUsageStatsPerNode.get(failedNodeException.nodeId());
+            if (nodeUsageStatsForThreadPools != null) {
+                cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools);


Can lastNodeUsageStatsPerNode be returned directly instead? putAll above adds the new values for the nodeId keys. So whatever nodes are missing in the new response will not be overridden in lastNodeUsageStatsPerNode.

Yes perhaps... I wonder if there's some way it could include values for nodes we didn't request? probably not if things happen in the sequence we expect them to happen. I will come back to this.

I think I'd prefer not to, because lastNodeUsageStasPerNode is internal state, it's mutable and it will be mutated by the collector (to expire values for nodes no longer in the cluster). I think the way it is is more explicit, we take the cached value for any node in response.failures() and we return a static map.

Ah I see, good point about immutability 👍

Would it be sufficient to return a copy of lastNodeUsageStatsPerNode and skip this whole for-loop? No longer present nodes have already been filtered out in a prior stage, and the successful node responses were applied above.

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

nicktindall · 2025-09-03T05:17:02Z

...asticsearch/action/admin/cluster/node/usage/TransportNodeUsageStatsForThreadPoolsAction.java

        return new NodeUsageStatsForThreadPoolsAction.NodeResponse(
            localNode,
-            new NodeUsageStatsForThreadPools(localNode.getId(), perThreadPool)
+            new NodeUsageStatsForThreadPools(localNode.getId(), Map.of(ThreadPool.Names.WRITE, threadPoolUsageStats))


Needed to do this to use Maps.copyMapWithAddedOrReplacedEntry, seems like it should be immutable anyhow.

nicktindall · 2025-09-03T05:20:51Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

+            return ClusterInfoServiceUtils.refresh(((InternalClusterInfoService) clusterInfoService));
        }
+        return null;
    }


Seemed useful to return the actual ClusterInfo here?

elasticsearchmachine · 2025-09-04T00:45:48Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

mhl-b · 2025-09-04T17:21:05Z

I think it's an ok solution, but stale metrics is consumer's tolerance, not producer's job. Producer does not know what degree of staleness is ok. So producer should produce fresh set of metrics and does not mock missing data, and every consumer cache(share cache) when it's needed. And I would make it explicit in consumer's code (Decider/Monitor) when we use cached or fresh, otherwise it's all blurry.

nicktindall · 2025-09-04T23:38:52Z

I think it's an ok solution, but stale metrics is consumer's tolerance, not producer's job. Producer does not know what degree of staleness is ok. So producer should produce fresh set of metrics and does not mock missing data, and every consumer cache(share cache) when it's needed. And I would make it explicit in consumer's code (Decider/Monitor) when we use cached or fresh, otherwise it's all blurry.

I guess it depends if you consider the collector a producer or a consumer. It would be annoying to have to implement this in multiple places, but I see your point. Perhaps we could add a timestamp to indicate the age of the metrics if it becomes important?

mhl-b · 2025-09-05T00:45:58Z

Maybe have a different data structure inside ClusterInfo for last known measurement, in case latest is missing. Attaching Timestamp sounds good.

nicktindall · 2025-09-05T07:19:30Z

...asticsearch/action/admin/cluster/node/usage/TransportNodeUsageStatsForThreadPoolsAction.java

        return new NodeUsageStatsForThreadPoolsAction.NodeResponse(
            localNode,
-            new NodeUsageStatsForThreadPools(localNode.getId(), perThreadPool)
+            new NodeUsageStatsForThreadPools(localNode.getId(), Map.of(ThreadPool.Names.WRITE, threadPoolUsageStats), Instant.now())


This timestamp being added on the source node could be problematic if there were clock skew in the cluster. I wonder if it should be recorded on the client side instead.

Or any "don't trust this if it's older than" should be of a magnitude that we don't need to worry about clock skew?

I'm not enthusiastic about the timestamp here. There's no use for the timestamp in the code and it's not obviously logged anyplace. It's invasive to add and seems to be trying to solve a problem we don't have.

Logging a WARN message whenever we fail to get fresh stats from a node would be sufficient to convey the time when that happens -- very rarely -- and that an issue occurred. It's reasonable to log a WARN message because the cluster is going to be in distress if there are repeated failures to fetch stats from a single or multiple nodes.

Yeah, fair, it also requires the addition of a transport version. @mhl-b wdyt? I can easily revert it. I think I'm with @DiannaHohensee, it's something we can add when we need it.

I'm ok with that. I still think unbounded staleness is useless, even harmful. Utilization from 5 minutes ago has no meaning. Maybe we should allow only one missing measurement, without timestamp, but if missed twice we dont report anything.

it's something we can add when we need it.

I dont think we can tell for sure once we blend together fresh and stale metrics. It would be some lagging node, that start to impact allocation decisions.

So you would track count of misses then?

No, I mean use the timestamp. Once it's over a certain age we log a warning (I think we do something similar for autoscaling metrics)

Then if we see it happening a lot or implicated in issues we can decide what to do about it

Does it mean you will keep current version with instant? But rather source node, use client-side time tracking?

It probably makes more sense to track it on the client (in the collector) that way we can probably avoid transport version changes, and don't have to worry about clock skew.

nicktindall · 2025-09-05T07:20:03Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPools.java

-            }
-        }
-        return true;
-    }


Redundant because it's a record

nicktindall · 2025-09-05T07:21:51Z

Maybe have a different data structure inside ClusterInfo for last known measurement, in case latest is missing. Attaching Timestamp sounds good.

I added a timestamp to the records. I don't think its necessary to expose details about successful fetches to the consumer, if someone cares about the age of a record they should be able to determine that from the timestamp?

DiannaHohensee

I took another look. If there's a good reason to add a timestamp, then that'd be fine. I can't currently see one, though, so that needs explanation.

DiannaHohensee · 2025-09-08T17:36:20Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

+            if (nodeUsageStatsForThreadPools != null) {
+                cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools);
+            }
+        }


Yes, always better than nothing 👍

A a cluster where a node is failing repeatedly to return stats probably has much bigger problems than this stale value.

DiannaHohensee · 2025-09-08T17:43:53Z

...asticsearch/action/admin/cluster/node/usage/TransportNodeUsageStatsForThreadPoolsAction.java

        return new NodeUsageStatsForThreadPoolsAction.NodeResponse(
            localNode,
-            new NodeUsageStatsForThreadPools(localNode.getId(), perThreadPool)
+            new NodeUsageStatsForThreadPools(localNode.getId(), Map.of(ThreadPool.Names.WRITE, threadPoolUsageStats), Instant.now())


I'm not enthusiastic about the timestamp here. There's no use for the timestamp in the code and it's not obviously logged anyplace. It's invasive to add and seems to be trying to solve a problem we don't have.

Logging a WARN message whenever we fail to get fresh stats from a node would be sufficient to convey the time when that happens -- very rarely -- and that an issue occurred. It's reasonable to log a WARN message because the cluster is going to be in distress if there are repeated failures to fetch stats from a single or multiple nodes.

...ernalClusterTest/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollectorIT.java

DiannaHohensee · 2025-09-08T18:30:55Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

+        for (FailedNodeException failedNodeException : response.failures()) {
+            final var nodeUsageStatsForThreadPools = lastNodeUsageStatsPerNode.get(failedNodeException.nodeId());
+            if (nodeUsageStatsForThreadPools != null) {
+                cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools);


Ah I see, good point about immutability 👍

Would it be sufficient to return a copy of lastNodeUsageStatsPerNode and skip this whole for-loop? No longer present nodes have already been filtered out in a prior stage, and the successful node responses were applied above.

ywangd · 2025-09-09T07:15:08Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

+        // Add in the last-seen usage stats for any nodes that failed to respond
+        final Map<String, NodeUsageStatsForThreadPools> cachedValuesForFailed = new HashMap<>(returnedUsageStats);
+        for (FailedNodeException failedNodeException : response.failures()) {
+            final var nodeUsageStatsForThreadPools = lastNodeUsageStatsPerNode.get(failedNodeException.nodeId());


I wonder, instead of using last value as a fallback, can we have a specific NodeUsageStatsForThreadPools object representing failure? Other parts of the code will have to check it explicitly to make decision, e.g. write load decider potentially rejecting allocation?

My thinking is that we probably don't want to fallback more than a few times, i.e. the last value needs to expire at certain point. I guess it's probably the reason you added the received timestamp? In that case, we still have to address what we use to indicate a "failed and expired" entry. If a node fails to respond ClusterInfo polling, it is likely overloaded, e.g. CBE. So seems safter to assume rejection or overall no movement for the node?

This reverts commit 8b374dd

…seen stats

ywangd

LGTM

Had only minor comments.

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java

ywangd · 2025-09-10T07:31:16Z

...ernalClusterTest/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollectorIT.java

+import static org.hamcrest.Matchers.equalTo;
+import static org.hamcrest.Matchers.hasKey;
+
+@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST)


Is this annotation necessary?

No, fixed in d5b17ab

ywangd · 2025-09-10T07:33:33Z

...ernalClusterTest/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollectorIT.java

+        // The next response should also contain our fake values
+        refreshClusterInfoAndAssertThreadPoolHasStats(
+            dataNodeClusterService.localNode().getId(),
+            threadPoolName,
+            totalThreadPoolThreads,
+            averageThreadPoolUtilization,
+            maxThreadPoolQueueLatencyMillis
+        );


I'd add one more step to ensure new value is used when the node is recovered from error.

Good call, added in ee9acc5

Co-authored-by: Yang Wang <[email protected]>

DiannaHohensee

LGTM!

mhl-b · 2025-09-10T17:09:29Z

...ernalClusterTest/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollectorIT.java

+        // Now simulate an error
+        dataNodeTransportService.clearInboundRules();
+        dataNodeTransportService.addRequestHandlingBehavior(
+            TransportNodeUsageStatsForThreadPoolsAction.NAME + "[n]",
+            (handler, request, channel, task) -> {
+                channel.sendResponse(new Exception("simulated error"));
+            }
+        );


mhl-b

LGTM

Use the last good NodeUsageStatsForThreadPools when a node returns an…

640f0da

… error

nicktindall requested a review from DiannaHohensee September 1, 2025 04:06

elasticsearchmachine added the v9.2.0 label Sep 1, 2025

Tidy

8c087f4

nicktindall commented Sep 1, 2025

View reviewed changes

nicktindall mentioned this pull request Sep 1, 2025

Model movements to nodes with no existing node stats #133901

Closed

nicktindall commented Sep 1, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java Show resolved Hide resolved

DiannaHohensee reviewed Sep 2, 2025

View reviewed changes

Add test

94d46a0

nicktindall commented Sep 3, 2025

View reviewed changes

Merge branch 'main' into use_last_good_node_usage

d072039

nicktindall marked this pull request as ready for review September 4, 2025 00:45

nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Sep 4, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Sep 4, 2025

Fix comment

1c6e239

nicktindall requested review from mhl-b and ywangd September 4, 2025 03:27

nicktindall added 2 commits September 5, 2025 17:10

Add timestamp to NodeUsageStatsForThreadPools

8b374dd

Merge remote-tracking branch 'origin/main' into use_last_good_node_usage

c3d3d3c

nicktindall commented Sep 5, 2025

View reviewed changes

DiannaHohensee reviewed Sep 8, 2025

View reviewed changes

ywangd reviewed Sep 9, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into use_last_good_node_usage

20fa020

nicktindall added 2 commits September 10, 2025 11:38

Revert "Add timestamp to NodeUsageStatsForThreadPools"

5fdf7d1

This reverts commit 8b374dd

Log when we get no stats back from a node, return a copy of the last …

6310e24

…seen stats

nicktindall requested review from DiannaHohensee and ywangd September 10, 2025 02:05

nicktindall added 3 commits September 10, 2025 12:08

Naming

355f6d9

Only log when there are failures

3decebd

Merge branch 'main' into use_last_good_node_usage

36ab0fa

ywangd approved these changes Sep 10, 2025

View reviewed changes

nicktindall and others added 4 commits September 10, 2025 19:09

Apply suggestion from @ywangd

ff795ac

Co-authored-by: Yang Wang <[email protected]>

Remove unneeded annotation

d5b17ab

Test that current values are returned again after errors

ee9acc5

Merge branch 'main' into use_last_good_node_usage

f8b9b4d

DiannaHohensee approved these changes Sep 10, 2025

View reviewed changes

mhl-b reviewed Sep 10, 2025

View reviewed changes

mhl-b approved these changes Sep 10, 2025

View reviewed changes

nicktindall merged commit e5559ef into main Sep 10, 2025
35 checks passed

nicktindall deleted the use_last_good_node_usage branch September 10, 2025 23:09

Uh oh!

Use the last good NodeUsageStatsForThreadPools when a node returns an error #133896

Use the last good NodeUsageStatsForThreadPools when a node returns an error #133896

Uh oh!

Conversation

nicktindall commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicktindall Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DiannaHohensee left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 4, 2025

Uh oh!

mhl-b commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicktindall commented Sep 4, 2025

Uh oh!

mhl-b commented Sep 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Sep 5, 2025

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Sep 1, 2025 •

edited

Loading

nicktindall Sep 1, 2025 •

edited

Loading

DiannaHohensee left a comment •

edited

Loading

mhl-b commented Sep 4, 2025 •

edited

Loading

nicktindall Sep 9, 2025 •

edited

Loading

ywangd Sep 9, 2025 •

edited

Loading