Implement WriteLoadConstraintMonitor #132917

nicktindall · 2025-08-14T08:41:52Z

Implement WriteLoadConstraintMonitor

Will call reroute if

The decider is enabled
The cluster state is recovered
At least one node is above the configured queue latency threshold
~~At least one other node is~~
- ~~Below the configured utilisation threshold~~
- ~~AND below the configured queue latency threshold~~
- ~~AND currently has no shards relocating off of it~~
We haven't called reroute within the configured minimum reroute interval

Relates ES-11992

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

…riteLoadConstraintMontitor

…riteLoadConstraintMontitor # Conflicts: # server/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintSettings.java

elasticsearchmachine · 2025-08-18T06:23:23Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nicktindall · 2025-08-18T06:24:57Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPools.java

-                && maxThreadPoolQueueLatencyMillis == other.maxThreadPoolQueueLatencyMillis;
-        }
-
-    } // ThreadPoolUsageStats


This is a record and these looked like the default equals/hashCode/toString

mhl-b

LGTM with nits

mhl-b · 2025-08-18T16:46:42Z

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

+
+    private void callReroute(Set<String> hotSpottedNodes) {
+        final String reason = Strings.format(
+            "write load constraint monitor: Found %d node(s) exceeding the write thread pool queue latency threshold",


nit: can you add total and below threshold nodes, please? maybe inline callReroute.

Would it be reasonable to list all of the node IDs for actively hot-spotting nodes? That'd make it quite clear which nodes caused the rebalancing work, giving a lead where to investigate further.

The only risk I can think of is that a very large cluster could end up listing a lot of nodes. That'd be in a very unhappy large cluster, but we could put an upper limit on how many nodes we'll list.

Updated to include a limited number of hot-spotting and under-threshold node IDs, and added the total number of nodes. Also in-lined callReroute.

See b816a15

Now I'm thinking maybe reroute message does not need this, but debug log with all mentioned above.

Fixed that method to handle no nodes and the limit properly in 8ad414b

I made the reason string simpler and added the more detailed string as a debug log 996ac06

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

DiannaHohensee

Reviewed everything but the test. I'm a little rushed at the end of my day today so apologies if a comment isn't quite clear.

My main suggestion is that we could use a time period after which we'd call reroute again for the same hot-spot. Rather than looking at shard movement activity on a data node. Not sure I thought through everything, maybe there are some counter-arguments.

.../src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintSettings.java

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

DiannaHohensee · 2025-08-18T22:46:58Z

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

+
+    private void callReroute(Set<String> hotSpottedNodes) {
+        final String reason = Strings.format(
+            "write load constraint monitor: Found %d node(s) exceeding the write thread pool queue latency threshold",


Would it be reasonable to list all of the node IDs for actively hot-spotting nodes? That'd make it quite clear which nodes caused the rebalancing work, giving a lead where to investigate further.

The only risk I can think of is that a very large cluster could end up listing a lot of nodes. That'd be in a very unhappy large cluster, but we could put an upper limit on how many nodes we'll list.

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

DiannaHohensee · 2025-08-18T23:39:10Z

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

+            return;
+        }
+
+        // Remove any over-threshold nodes that already have shards relocating away


This seems okay to me be because the shard started/failed cluster state update provokes a reroute() call. Not sure if that's what you were aiming at? Otherwise, I'd be worried that a cluster that is doing a lot of rebalancing for a significant amount of time may not reconsider shard allocation decisions when there's a hot-spot. This and this bit of code are responsible for the reroute on cluster state update post shard state change, IIUC. Could you add that argument to the comment, if you agree? I think there should be an explanation of why here.

Additionally, I was originally thinking that the monitor would maintain both that a node is hot-spotting and the timestamp when the node's hot-spot began (and then update the timestamp if/when reroute is called again for the same node). That way, after a reasonable amount of time in which we'd expect the hot-spot to have been addressed, the monitor can instigate reroute again for the same hot-spotting node.

DiannaHohensee · 2025-08-18T23:48:18Z

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

+        // Remove any over-threshold nodes that already have shards relocating away
+        final RoutingNodes routingNodes = state.getRoutingNodes();
+        nodeIdsExceedingLatencyThreshold.removeIf(
+            nodeId -> routingNodes.node(nodeId).numberOfShardsWithState(ShardRoutingState.RELOCATING) > 0


Is this reliable? I haven't followed around how RELOCATING is used, but my understanding is that reconciliation will select a small subset of shard moves from the DesiredBalance and update the cluster state to start those moves. So there could be 100 shard moves queued for nodeA in the DesiredBalance, but maybe reconciliation fulfilled the shard move quota with other nodes. Or nodeA can't move shards until some other target node moves some shards off first. Etc.

If I do understand correctly, I'd be inclined to move away from this solution, since it'd be difficult to get right, and instead track the timestamp start of a node hot-spot, or the last reroute call because of that node hot-spot, and call reroute again if enough time passes for a particular node hot-spot.

Probably still obey haveCalledRerouteRecently first, but if haven't called recently, allow reroute for the same hot-spot.

This was just to address the following from the ticket

Wait to call reroute if shards are already scheduled to move away from the hot node, until fresh node write load data is received after those moves have completed. The balancer may already be resolving the hot-spot.

As long as the hot-spotted node has some shards RELOCATING (this is the status for a shard that's moving on the source side, the target side will be INITIALIZING), we won't call reroute for that node/shard. If there are other hot-spotted nodes with no relocations ongoing this won't prevent reroute being called.

You're right this won't take into account undesired allocations. Perhaps a better solution would expand the condition to node has shards with state = RELOCATING || node-has-undesired-allocations?

I don't think we currently have information about undesired allocations. So we'd need to do some additional work to get that into the cluster info if we wanted to implement the above.

Wait to call reroute if shards are already scheduled to move away from the hot node, until fresh node write load data is received after those moves have completed. The balancer may already be resolving the hot-spot.

Ah yes, my mistake, I didn't consider all the interpretations.

Right, I don't think we have an undesired count saved any place. Even the DesiredBalance is a list of final assignments, and reconciliation looks for nodes missing a shard assignment. There's no running count. Thus the timestamp idea, keeping track of when reroute was last called for a hot-spot, and recalling reroute if X (5 mins?) time has passed and the hot-spot hasn't been resolved. I don't know of much harm in re-calling reroute, as opposed to the risk of not calling it.

Thus the timestamp idea, keeping track of when reroute was last called for a hot-spot, and recalling reroute if X (5 mins?) time has passed and the hot-spot hasn't been resolved. I don't know of much harm in re-calling reroute, as opposed to the risk of not calling it.

I think that's how this will already work. If the hot-spot is not resolved and haveCalledRerouteRecently == false then we'll call reroute again.

I think it's probably worthwhile to wait until we see no movement away from a hot-spotted node before we decide to intervene. As you pointed out there will potentially be queued moves in the desired balancer that we're not privy to, but this condition seems better than nothing.

I'd rather we removed this check. We don't know how long ago the DesiredBalance being executed was calculated, and on what data it made allocation choices, which could delay addressing a new hot-spot. We'll wait 30 seconds before calling reroute again, and there is no harm in calling reroute again: if nothing has changed, then the new DesiredBalance will be the same.

I think that's how this will already work. If the hot-spot is not resolved and haveCalledRerouteRecently == false then we'll call reroute again.

I meant a timestamp per node hot-spot, as opposed to a single timestamp for all nodes (current implementation). But the timestamp handling works as is.

I think I'd also prefer removing this check and leave it to the deciders/simulator. If simulation says that a shard leaves the node then that will already handle it.

I would prefer not to add further delays though. I think the ClusterInfo poll is enough. We can keep the reroute_interval, but as an operational tool (defaulting to 0s).

Removed this check in 1789a51

DiannaHohensee · 2025-08-18T23:57:53Z

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

-            logger.debug("rerouting shards: [{}]", explanation);
-            rerouteService.reroute("disk threshold monitor", Priority.NORMAL, ActionListener.wrap(ignored -> {
-                final var reroutedClusterState = clusterStateSupplier.get();
+        if (Sets.difference(nodeIdsBelowUtilizationThreshold, nodeIdsExceedingLatencyThreshold).isEmpty()) {


Could we make this have no overlap? I'd assume in the filtering code above to populate these two sets that a hot-spotting node is at 100% utilization. It doesn't make a lot of sense to have a hot spot but the node is below 90% utilization (or whatever we set the default to).

Then we'd have a check such as

if (nodeIdsBelowUtilizationThreshold.isEmpty || nodeIdsExceedingLatencyThreshold.isEmpty()) { // Do nothing, because either there aren't any target nodes or there aren't any source hot-spotting nodes. }

I think we'll have to do some magic in the ES-12623 and ES-12634 to always supply 100% node utilization in the ClusterInfo when a node is hot spotted (in case of strange stat number reports), but that's different work.

Done in ee54418

I'd assume in the filtering code above to populate these two sets that a hot-spotting node is at 100% utilization.

That's not a safe assumption. It'll probably be very high, but because it's an average there's a good chance it won't be 100%. I don't think we should necessarily fudge the numbers either, especially if we are going to use those numbers for simulation, we'd just be throwing information away.

Ah I see. I was implicitly thinking any node without the queue latency -- so nodes between the low and high thresholds -- would be eligible to receive more shards. But the current implementation is not to do shard movements unless there are nodes below the low threshold (90% cpu usage).

Would we want that behavior? Suppose one node is queueing, and 5 other nodes are at 92% CPU utilization. It seems like it would still be better to initiate rebalancing.

I think the above falls outside our definition of a hot-spot. If we find ourselves in that situation for a long period then I would argue autoscaling is broken.

…riteLoadConstraintMontitor

…ion/WriteLoadConstraintMonitor.java Co-authored-by: Dianna Hohensee <[email protected]>

…allReroute

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

mhl-b

still LGTM, thanks

…riteLoadConstraintMontitor

DiannaHohensee

Only one comment about the monitor logic.

I read the test file, but haven't dug into the test cases. I'll get that turned around tomorrow: first thing on my todo list.

.../test/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitorTests.java

DiannaHohensee · 2025-08-26T01:14:13Z

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

+        // Remove any over-threshold nodes that already have shards relocating away
+        final RoutingNodes routingNodes = state.getRoutingNodes();
+        nodeIdsExceedingLatencyThreshold.removeIf(
+            nodeId -> routingNodes.node(nodeId).numberOfShardsWithState(ShardRoutingState.RELOCATING) > 0


I'd rather we removed this check. We don't know how long ago the DesiredBalance being executed was calculated, and on what data it made allocation choices, which could delay addressing a new hot-spot. We'll wait 30 seconds before calling reroute again, and there is no harm in calling reroute again: if nothing has changed, then the new DesiredBalance will be the same.

I think that's how this will already work. If the hot-spot is not resolved and haveCalledRerouteRecently == false then we'll call reroute again.

I meant a timestamp per node hot-spot, as opposed to a single timestamp for all nodes (current implementation). But the timestamp handling works as is.

DiannaHohensee

My last set of comments are just test rename nits.

The only substantial change I'd like is in the comment here. Checking ongoing shard moves is too unpredictable, and detrimental if a DesiredBalance has not been computed on fresh data.

.../test/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitorTests.java

henningandersen

Left a few comments.

henningandersen · 2025-08-26T19:41:28Z

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

+            if (writeThreadPoolStats.maxThreadPoolQueueLatencyMillis() > writeLoadConstraintSettings.getQueueLatencyThreshold().millis()) {
+                nodeIdsExceedingLatencyThreshold.add(nodeId);
+            } else if (writeThreadPoolStats.averageThreadPoolUtilization() <= writeLoadConstraintSettings.getHighUtilizationThreshold()) {
+                potentialRelocationTargets.add(nodeId);


I wonder if we need this. I would be fine with calling reroute once for every onNewInfo call with a node having a queue latency above the threshold. Then leave it to the deciders to figure out if anything can move rather than be too smart about it here.

I'm happy to remove it if we don't mind calling reroute potentially more frequently. I thought we were trying to identify when we're hot-spotting in this logic, our working definition of hot-spotting includes that there are nodes with capacity if I'm not mistaken. But don't have strong feelings about it. I believe there's work scheduled to do the determination of "hot-spotting" elsewhere, which I'm not 100% clear on.

I removed this additional check, we can put it in if it's needed later.

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

henningandersen · 2025-08-26T19:46:24Z

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java

+        // Remove any over-threshold nodes that already have shards relocating away
+        final RoutingNodes routingNodes = state.getRoutingNodes();
+        nodeIdsExceedingLatencyThreshold.removeIf(
+            nodeId -> routingNodes.node(nodeId).numberOfShardsWithState(ShardRoutingState.RELOCATING) > 0


I think I'd also prefer removing this check and leave it to the deciders/simulator. If simulation says that a shard leaves the node then that will already handle it.

I would prefer not to add further delays though. I think the ClusterInfo poll is enough. We can keep the reroute_interval, but as an operational tool (defaulting to 0s).

.../src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintSettings.java

DiannaHohensee · 2025-08-28T20:15:49Z

.../test/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitorTests.java

+    public void testRerouteIsNotCalledAgainBeforeMinimumIntervalHasPassed() {
+        final TestState testState = createRandomTestStateThatWillTriggerReroute();
+        final TimeValue minimumInterval = testState.clusterSettings.get(
+            WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_REROUTE_INTERVAL_SETTING


I expect this test will need a non default INTERVAL setting to be meaningful now. Perhaps randomize it.

Done in 2abd946

DiannaHohensee · 2025-08-28T20:17:24Z

.../test/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitorTests.java

+        value = "org.elasticsearch.cluster.routing.allocation.WriteLoadConstraintMonitor:DEBUG",
+        reason = "ensure we're skipping reroute for the right reason"
+    )
+    public void testRerouteIsCalledBeforeMinimumIntervalHasPassedIfNewNodesBecomeHotSpotted() {


Same as above, maybe a high non-default INTERVAL setting so we're sure it's not applicable here.

Also done in 2abd946

DiannaHohensee

Lgtm. There are a couple test fixes to set a non zero interval setting to be realistic again, but that's straightforward.

…riteLoadConstraintMontitor

…eshold

First pass on WriteLoadConstraintMonitor

2dd9eac

nicktindall requested review from DiannaHohensee and mhl-b August 14, 2025 08:41

elasticsearchmachine added the v9.2.0 label Aug 14, 2025

nicktindall added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue labels Aug 14, 2025

mhl-b reviewed Aug 14, 2025

View reviewed changes

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java Outdated Show resolved Hide resolved

mhl-b reviewed Aug 14, 2025

View reviewed changes

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java Show resolved Hide resolved

nicktindall added 6 commits August 15, 2025 10:09

Fix disabled condition

471bc19

Merge remote-tracking branch 'origin/main' into ES-119922_implement_W…

0e4db55

…riteLoadConstraintMontitor

Add tests

7ab66a7

Merge remote-tracking branch 'origin/main' into ES-119922_implement_W…

018e100

…riteLoadConstraintMontitor # Conflicts: # server/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintSettings.java

equal to threshold is OK

6bd054f

More random

2484a77

nicktindall marked this pull request as ready for review August 18, 2025 06:23

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Aug 18, 2025

nicktindall commented Aug 18, 2025

View reviewed changes

nicktindall requested a review from mhl-b August 18, 2025 06:25

mhl-b approved these changes Aug 18, 2025

View reviewed changes

DiannaHohensee reviewed Aug 19, 2025

View reviewed changes

nicktindall and others added 9 commits August 19, 2025 10:50

Merge remote-tracking branch 'origin/main' into ES-119922_implement_W…

899ba00

…riteLoadConstraintMontitor

Store high utilization threshold as ratio

7049a5e

Change default queue latency threshold to 5s

c57935e

Update server/src/main/java/org/elasticsearch/cluster/routing/allocat…

6971468

…ion/WriteLoadConstraintMonitor.java Co-authored-by: Dianna Hohensee <[email protected]>

Use notFullyEnabled

d3308d2

Include hot-spotted and under-threshold node IDs in reason, in-line c…

b816a15

…allReroute

Fix constraints

8ad414b

Assert write thread pool stats are present

62bdc83

Change debug message when no nodes exceeding threshold

a326745

ywangd reviewed Aug 19, 2025

View reviewed changes

...r/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintMonitor.java Outdated Show resolved Hide resolved

nicktindall added 2 commits August 21, 2025 11:41

Tidy

d8ff9e5

Simplify reason, write detailed log message when debug enabled

996ac06

mhl-b approved these changes Aug 21, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into ES-119922_implement_W…

2f7123f

…riteLoadConstraintMontitor

nicktindall requested a review from DiannaHohensee August 25, 2025 23:11

DiannaHohensee reviewed Aug 26, 2025

View reviewed changes

henningandersen reviewed Aug 26, 2025

View reviewed changes

nicktindall added 10 commits August 27, 2025 11:06

Don't ignore hot-spotted nodes with shard movement in progress

1789a51

latencyThresholdMillis -> queueLatencyThresholdMillis

6e5d393

Use system time by default

8a66257

Remove dead code

8a2b43b

rerouteService -> mockRerouteService

8b23b03

createClusterInfoWithHotSpots, add better javadoc

1cd4154

Fix test names

1a4979a

Improve comment

98ba24d

Leave default as 30 for this PR

1d63962

Merge branch 'main' into ES-119922_implement_WriteLoadConstraintMontitor

3510567

nicktindall requested review from DiannaHohensee and henningandersen August 28, 2025 10:25

DiannaHohensee reviewed Aug 28, 2025

View reviewed changes

DiannaHohensee approved these changes Aug 28, 2025

View reviewed changes

nicktindall added 3 commits August 29, 2025 09:20

Explicitly configure a non-zero reroute interval

2abd946

Merge remote-tracking branch 'origin/main' into ES-119922_implement_W…

06beb49

…riteLoadConstraintMontitor

Call reroute even if there are no nodes under utilisation/latency thr…

81eb6e5

…eshold

nicktindall merged commit 31e3c55 into elastic:main Aug 29, 2025
33 checks passed

JeremyDahlgren pushed a commit to JeremyDahlgren/elasticsearch that referenced this pull request Aug 29, 2025

Implement WriteLoadConstraintMonitor (elastic#132917)

5169d09

nicktindall deleted the ES-119922_implement_WriteLoadConstraintMontitor branch September 3, 2025 04:19

Implement WriteLoadConstraintMonitor #132917

Implement WriteLoadConstraintMonitor #132917

Uh oh!

Conversation

nicktindall commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DiannaHohensee left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

nicktindall commented Aug 14, 2025 •

edited

Loading

DiannaHohensee left a comment •

edited

Loading

nicktindall Aug 19, 2025 •

edited

Loading

DiannaHohensee Aug 19, 2025 •

edited

Loading

DiannaHohensee Aug 28, 2025 •

edited

Loading