Model movements to nodes with no existing node stats #133901

nicktindall · 2025-09-01T06:51:51Z

First attempt a modelling movements to nodes that have no existing node stats

still a few questions lingering. Put up for discussion.

In #133896, we cover the case where a node returned an error for node stats, but had previously returned values, and this PR should address the case where a node was added since the last ClusterInfo was returned, and a new one is yet to be generated.

Relates: ES-12621

nicktindall · 2025-09-01T06:52:45Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

+            .map(NodeUsageStatsForThreadPools::threadPoolUsageStatsMap)
+            .map(m -> m.get(ThreadPool.Names.WRITE))
+            .mapToInt(NodeUsageStatsForThreadPools.ThreadPoolUsageStats::totalThreadPoolThreads)
+            .max();


Assuming the max thread pool size is probably "optimistic", we could also be pessimistic and assume the minimum pool size

Since we don't really know what would be best, can we pick the first node in the map? Then we avoid any performance slow downs with streams or iteration, since this path is going to be hit a lot.

nicktindall · 2025-09-01T06:53:50Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

+                );
+            });
+        } else {
+            logger.debug("No nodes found to estimate write thread pool size, skipping");


If we get here, there were no other nodes in the ClusterInfo to base our estimate off of. In this case we could also assume the same pool size as the local node perhaps? but we don't have that information in here. This also seems very unlikely to actually occur.

nicktindall · 2025-09-01T06:55:58Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

+                                largestWriteThreadPool.getAsInt(),
+                                updateNodeUtilizationWithShardMovements(0.0f, (float) writeLoadDelta, largestWriteThreadPool.getAsInt()),
+                                0
+                            )


This will fudge a NodeUsageStatsForThreadPools that is good for our usage (i.e. it has a single entry for the WRITE pool). Nobody else relies on these stats at the moment, so that's probably OK, but if another decider starts using these stats we'd need to be careful about doing this maybe.

We can also probably do better in the event a node with no stats has a shard moved off of it. In that case it doesn't make sense to assume the node was empty before the move, because clearly it was not, but then it seems very unlikely we'd find ourselves in that situation so I don't know how much effort we want to put into improving that estimate.

This will fudge a NodeUsageStatsForThreadPools that is good for our usage (i.e. it has a single entry for the WRITE pool). Nobody else relies on these stats at the moment, so that's probably OK, but if another decider starts using these stats we'd need to be careful about doing this maybe.

Just FYI, the Decider does take some action based on the ClusterInfo it receives.

We can also probably do better in the event a node with no stats has a shard moved off of it. In that case it doesn't make sense to assume the node was empty before the move, because clearly it was not, but then it seems very unlikely we'd find ourselves in that situation so I don't know how much effort we want to put into improving that estimate.

I agree, not worth the effort. Not sure the scenario can even happen, actually. A new node joins, we'd have to assign shards to the node, first, which means simulation begins. The simulation might be a little off because of the thread pool thread count guess, but otherwise fine.

DiannaHohensee

Left some notes and suggestions.

We could also reconsider saving originalNodeUsageStatsForThreadPools and maintaining a diff, and instead immediately apply the shard movement change to the stored values. I think allowing negative utilization values would preserve precision -- IIRC, that was the argument for keeping a diff. Just a note in case that simplifies the work, or could lead to fewer loops / better performance.

I think I noticed recently in a test that we fetch the ClusterInfo from the simulator a LOT. Hard to say without exploring further / performance testing. Not a problem to solve here, but to explain my interest in minimizing the work in simulatedNodeUsageStatsForThreadPools().

DiannaHohensee · 2025-09-02T21:44:25Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

+            .map(NodeUsageStatsForThreadPools::threadPoolUsageStatsMap)
+            .map(m -> m.get(ThreadPool.Names.WRITE))
+            .mapToInt(NodeUsageStatsForThreadPools.ThreadPoolUsageStats::totalThreadPoolThreads)
+            .max();


Since we don't really know what would be best, can we pick the first node in the map? Then we avoid any performance slow downs with streams or iteration, since this path is going to be hit a lot.

DiannaHohensee · 2025-09-02T21:49:48Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

+                                largestWriteThreadPool.getAsInt(),
+                                updateNodeUtilizationWithShardMovements(0.0f, (float) writeLoadDelta, largestWriteThreadPool.getAsInt()),
+                                0
+                            )


This will fudge a NodeUsageStatsForThreadPools that is good for our usage (i.e. it has a single entry for the WRITE pool). Nobody else relies on these stats at the moment, so that's probably OK, but if another decider starts using these stats we'd need to be careful about doing this maybe.

Just FYI, the Decider does take some action based on the ClusterInfo it receives.

DiannaHohensee · 2025-09-02T21:51:18Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

+                                largestWriteThreadPool.getAsInt(),
+                                updateNodeUtilizationWithShardMovements(0.0f, (float) writeLoadDelta, largestWriteThreadPool.getAsInt()),
+                                0
+                            )


We can also probably do better in the event a node with no stats has a shard moved off of it. In that case it doesn't make sense to assume the node was empty before the move, because clearly it was not, but then it seems very unlikely we'd find ourselves in that situation so I don't know how much effort we want to put into improving that estimate.

I agree, not worth the effort. Not sure the scenario can even happen, actually. A new node joins, we'd have to assign shards to the node, first, which means simulation begins. The simulation might be a little off because of the thread pool thread count guess, but otherwise fine.

DiannaHohensee · 2025-09-02T22:33:10Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

+        Map<String, NodeUsageStatsForThreadPools> nodeUsageStatsForThreadPools
+    ) {
+        // Assume the new node has the same size thread pool as the largest existing node
+        final OptionalInt largestWriteThreadPool = originalNodeUsageStatsForThreadPools.values()


Right now we have to build the new node basis every time we calculated an updated ClusterInfo. Could we instead immediately add a new 0 utilization node to originalNodeUsageStatsForThreadPools? Then we can move forward always expecting an entry in originalNodeUsageStatsForThreadPools for every node. Might simplify the code.

DiannaHohensee · 2025-09-02T22:39:37Z

server/src/main/java/org/elasticsearch/cluster/routing/ShardMovementWriteLoadSimulator.java

                adjustedNodeUsageStatsForThreadPools.put(entry.getKey(), entry.getValue());
            }
        }
+


instead of adding a second non-trivial for-loop in addUsageStatsForAnyNodesNotPresentInOriginalNodeUsageStatsForThreadPools, could we replace the above original for-loop with iteration of the simulatedNodeWriteLoadDeltas data structure to begin with?
Then use something like originalNodeUsageStatsForThreadPools.forEach(adjustedNodeUsageStatsForThreadPools::putIfAbsent) to straight copy the remainder.

This would be in combination with my other suggestion to add the new node to originalNodeUsageStatsForThreadPools beforehand for simplicity.

nicktindall · 2025-09-04T01:44:16Z

As discussed, I think we can avoid making this change, I added a test to confirm that we do trigger a ClusterInfo refresh when a node joins the cluster, which means this situation should only exist very briefly.

If we agree I will close this? @DiannaHohensee @ywangd @mhl-b

DiannaHohensee · 2025-09-08T20:07:54Z

Sounds like it's not possible for a reroute request to use ClusterState containing a node not present in the ClusterInfo loaded for the balancing round? If so, agreed 👍

Model movements to nodes with no existing node stats

ef0a986

elasticsearchmachine added the v9.2.0 label Sep 1, 2025

nicktindall commented Sep 1, 2025

View reviewed changes

nicktindall requested a review from DiannaHohensee September 1, 2025 07:02

DiannaHohensee mentioned this pull request Sep 2, 2025

Use the last good NodeUsageStatsForThreadPools when a node returns an error #133896

Merged

DiannaHohensee reviewed Sep 2, 2025

View reviewed changes

nicktindall closed this Sep 9, 2025

nicktindall deleted the model_movements_new_nodes branch October 7, 2025 00:08

Model movements to nodes with no existing node stats #133901

Model movements to nodes with no existing node stats #133901

Uh oh!

Conversation

nicktindall commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiannaHohensee commented Sep 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nicktindall commented Sep 1, 2025 •

edited

Loading

DiannaHohensee Sep 2, 2025 •

edited

Loading

DiannaHohensee Sep 2, 2025 •

edited

Loading

DiannaHohensee Sep 2, 2025 •

edited

Loading

nicktindall commented Sep 4, 2025 •

edited

Loading