Skip to content

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Aug 14, 2025

Implement WriteLoadConstraintMonitor

Will call reroute if

  • The decider is enabled
  • The cluster state is recovered
  • At least one node is above the configured queue latency threshold
  • At least one other node is
    • Below the configured utilisation threshold
    • AND below the configured queue latency threshold
    • AND currently has no shards relocating off of it
  • We haven't called reroute within the configured minimum reroute interval

Relates ES-11992

@nicktindall nicktindall added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue labels Aug 14, 2025
…riteLoadConstraintMontitor

# Conflicts:
#	server/src/main/java/org/elasticsearch/cluster/routing/allocation/WriteLoadConstraintSettings.java
@nicktindall nicktindall marked this pull request as ready for review August 18, 2025 06:23
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Aug 18, 2025
&& maxThreadPoolQueueLatencyMillis == other.maxThreadPoolQueueLatencyMillis;
}

} // ThreadPoolUsageStats
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a record and these looked like the default equals/hashCode/toString

@nicktindall nicktindall requested a review from mhl-b August 18, 2025 06:25
Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits


private void callReroute(Set<String> hotSpottedNodes) {
final String reason = Strings.format(
"write load constraint monitor: Found %d node(s) exceeding the write thread pool queue latency threshold",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add total and below threshold nodes, please? maybe inline callReroute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to list all of the node IDs for actively hot-spotting nodes? That'd make it quite clear which nodes caused the rebalancing work, giving a lead where to investigate further.

The only risk I can think of is that a very large cluster could end up listing a lot of nodes. That'd be in a very unhappy large cluster, but we could put an upper limit on how many nodes we'll list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to include a limited number of hot-spotting and under-threshold node IDs, and added the total number of nodes. Also in-lined callReroute.

See b816a15

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm thinking maybe reroute message does not need this, but debug log with all mentioned above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed that method to handle no nodes and the limit properly in 8ad414b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the reason string simpler and added the more detailed string as a debug log 996ac06

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed everything but the test. I'm a little rushed at the end of my day today so apologies if a comment isn't quite clear.

My main suggestion is that we could use a time period after which we'd call reroute again for the same hot-spot. Rather than looking at shard movement activity on a data node. Not sure I thought through everything, maybe there are some counter-arguments.


private void callReroute(Set<String> hotSpottedNodes) {
final String reason = Strings.format(
"write load constraint monitor: Found %d node(s) exceeding the write thread pool queue latency threshold",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to list all of the node IDs for actively hot-spotting nodes? That'd make it quite clear which nodes caused the rebalancing work, giving a lead where to investigate further.

The only risk I can think of is that a very large cluster could end up listing a lot of nodes. That'd be in a very unhappy large cluster, but we could put an upper limit on how many nodes we'll list.

return;
}

// Remove any over-threshold nodes that already have shards relocating away
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems okay to me be because the shard started/failed cluster state update provokes a reroute() call. Not sure if that's what you were aiming at? Otherwise, I'd be worried that a cluster that is doing a lot of rebalancing for a significant amount of time may not reconsider shard allocation decisions when there's a hot-spot. This and this bit of code are responsible for the reroute on cluster state update post shard state change, IIUC. Could you add that argument to the comment, if you agree? I think there should be an explanation of why here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I was originally thinking that the monitor would maintain both that a node is hot-spotting and the timestamp when the node's hot-spot began (and then update the timestamp if/when reroute is called again for the same node). That way, after a reasonable amount of time in which we'd expect the hot-spot to have been addressed, the monitor can instigate reroute again for the same hot-spotting node.

// Remove any over-threshold nodes that already have shards relocating away
final RoutingNodes routingNodes = state.getRoutingNodes();
nodeIdsExceedingLatencyThreshold.removeIf(
nodeId -> routingNodes.node(nodeId).numberOfShardsWithState(ShardRoutingState.RELOCATING) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reliable? I haven't followed around how RELOCATING is used, but my understanding is that reconciliation will select a small subset of shard moves from the DesiredBalance and update the cluster state to start those moves. So there could be 100 shard moves queued for nodeA in the DesiredBalance, but maybe reconciliation fulfilled the shard move quota with other nodes. Or nodeA can't move shards until some other target node moves some shards off first. Etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do understand correctly, I'd be inclined to move away from this solution, since it'd be difficult to get right, and instead track the timestamp start of a node hot-spot, or the last reroute call because of that node hot-spot, and call reroute again if enough time passes for a particular node hot-spot.

Probably still obey haveCalledRerouteRecently first, but if haven't called recently, allow reroute for the same hot-spot.

Copy link
Contributor Author

@nicktindall nicktindall Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just to address the following from the ticket

Wait to call reroute if shards are already scheduled to move away from the hot node, until fresh node write load data is received after those moves have completed. The balancer may already be resolving the hot-spot.

As long as the hot-spotted node has some shards RELOCATING (this is the status for a shard that's moving on the source side, the target side will be INITIALIZING), we won't call reroute for that node/shard. If there are other hot-spotted nodes with no relocations ongoing this won't prevent reroute being called.

You're right this won't take into account undesired allocations. Perhaps a better solution would expand the condition to node has shards with state = RELOCATING || node-has-undesired-allocations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we currently have information about undesired allocations. So we'd need to do some additional work to get that into the cluster info if we wanted to implement the above.

Copy link
Contributor

@DiannaHohensee DiannaHohensee Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait to call reroute if shards are already scheduled to move away from the hot node, until fresh node write load data is received after those moves have completed. The balancer may already be resolving the hot-spot.

Ah yes, my mistake, I didn't consider all the interpretations.

Right, I don't think we have an undesired count saved any place. Even the DesiredBalance is a list of final assignments, and reconciliation looks for nodes missing a shard assignment. There's no running count. Thus the timestamp idea, keeping track of when reroute was last called for a hot-spot, and recalling reroute if X (5 mins?) time has passed and the hot-spot hasn't been resolved. I don't know of much harm in re-calling reroute, as opposed to the risk of not calling it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus the timestamp idea, keeping track of when reroute was last called for a hot-spot, and recalling reroute if X (5 mins?) time has passed and the hot-spot hasn't been resolved. I don't know of much harm in re-calling reroute, as opposed to the risk of not calling it.

I think that's how this will already work. If the hot-spot is not resolved and haveCalledRerouteRecently == false then we'll call reroute again.

I think it's probably worthwhile to wait until we see no movement away from a hot-spotted node before we decide to intervene. As you pointed out there will potentially be queued moves in the desired balancer that we're not privy to, but this condition seems better than nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we removed this check. We don't know how long ago the DesiredBalance being executed was calculated, and on what data it made allocation choices, which could delay addressing a new hot-spot. We'll wait 30 seconds before calling reroute again, and there is no harm in calling reroute again: if nothing has changed, then the new DesiredBalance will be the same.

I think that's how this will already work. If the hot-spot is not resolved and haveCalledRerouteRecently == false then we'll call reroute again.

I meant a timestamp per node hot-spot, as opposed to a single timestamp for all nodes (current implementation). But the timestamp handling works as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd also prefer removing this check and leave it to the deciders/simulator. If simulation says that a shard leaves the node then that will already handle it.

I would prefer not to add further delays though. I think the ClusterInfo poll is enough. We can keep the reroute_interval, but as an operational tool (defaulting to 0s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this check in 1789a51

logger.debug("rerouting shards: [{}]", explanation);
rerouteService.reroute("disk threshold monitor", Priority.NORMAL, ActionListener.wrap(ignored -> {
final var reroutedClusterState = clusterStateSupplier.get();
if (Sets.difference(nodeIdsBelowUtilizationThreshold, nodeIdsExceedingLatencyThreshold).isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this have no overlap? I'd assume in the filtering code above to populate these two sets that a hot-spotting node is at 100% utilization. It doesn't make a lot of sense to have a hot spot but the node is below 90% utilization (or whatever we set the default to).

Then we'd have a check such as

if (nodeIdsBelowUtilizationThreshold.isEmpty || nodeIdsExceedingLatencyThreshold.isEmpty()) {
    // Do nothing, because either there aren't any target nodes or there aren't any source hot-spotting nodes.
}

I think we'll have to do some magic in the ES-12623 and ES-12634 to always supply 100% node utilization in the ClusterInfo when a node is hot spotted (in case of strange stat number reports), but that's different work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ee54418

I'd assume in the filtering code above to populate these two sets that a hot-spotting node is at 100% utilization.

That's not a safe assumption. It'll probably be very high, but because it's an average there's a good chance it won't be 100%. I don't think we should necessarily fudge the numbers either, especially if we are going to use those numbers for simulation, we'd just be throwing information away.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I was implicitly thinking any node without the queue latency -- so nodes between the low and high thresholds -- would be eligible to receive more shards. But the current implementation is not to do shard movements unless there are nodes below the low threshold (90% cpu usage).

Would we want that behavior? Suppose one node is queueing, and 5 other nodes are at 92% CPU utilization. It seems like it would still be better to initiate rebalancing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the above falls outside our definition of a hot-spot. If we find ourselves in that situation for a long period then I would argue autoscaling is broken.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still LGTM, thanks

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one comment about the monitor logic.

I read the test file, but haven't dug into the test cases. I'll get that turned around tomorrow: first thing on my todo list.

// Remove any over-threshold nodes that already have shards relocating away
final RoutingNodes routingNodes = state.getRoutingNodes();
nodeIdsExceedingLatencyThreshold.removeIf(
nodeId -> routingNodes.node(nodeId).numberOfShardsWithState(ShardRoutingState.RELOCATING) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we removed this check. We don't know how long ago the DesiredBalance being executed was calculated, and on what data it made allocation choices, which could delay addressing a new hot-spot. We'll wait 30 seconds before calling reroute again, and there is no harm in calling reroute again: if nothing has changed, then the new DesiredBalance will be the same.

I think that's how this will already work. If the hot-spot is not resolved and haveCalledRerouteRecently == false then we'll call reroute again.

I meant a timestamp per node hot-spot, as opposed to a single timestamp for all nodes (current implementation). But the timestamp handling works as is.

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last set of comments are just test rename nits.

The only substantial change I'd like is in the comment here. Checking ongoing shard moves is too unpredictable, and detrimental if a DesiredBalance has not been computed on fresh data.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments.

if (writeThreadPoolStats.maxThreadPoolQueueLatencyMillis() > writeLoadConstraintSettings.getQueueLatencyThreshold().millis()) {
nodeIdsExceedingLatencyThreshold.add(nodeId);
} else if (writeThreadPoolStats.averageThreadPoolUtilization() <= writeLoadConstraintSettings.getHighUtilizationThreshold()) {
potentialRelocationTargets.add(nodeId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need this. I would be fine with calling reroute once for every onNewInfo call with a node having a queue latency above the threshold. Then leave it to the deciders to figure out if anything can move rather than be too smart about it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to remove it if we don't mind calling reroute potentially more frequently. I thought we were trying to identify when we're hot-spotting in this logic, our working definition of hot-spotting includes that there are nodes with capacity if I'm not mistaken. But don't have strong feelings about it. I believe there's work scheduled to do the determination of "hot-spotting" elsewhere, which I'm not 100% clear on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this additional check, we can put it in if it's needed later.

// Remove any over-threshold nodes that already have shards relocating away
final RoutingNodes routingNodes = state.getRoutingNodes();
nodeIdsExceedingLatencyThreshold.removeIf(
nodeId -> routingNodes.node(nodeId).numberOfShardsWithState(ShardRoutingState.RELOCATING) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd also prefer removing this check and leave it to the deciders/simulator. If simulation says that a shard leaves the node then that will already handle it.

I would prefer not to add further delays though. I think the ClusterInfo poll is enough. We can keep the reroute_interval, but as an operational tool (defaulting to 0s).

public void testRerouteIsNotCalledAgainBeforeMinimumIntervalHasPassed() {
final TestState testState = createRandomTestStateThatWillTriggerReroute();
final TimeValue minimumInterval = testState.clusterSettings.get(
WriteLoadConstraintSettings.WRITE_LOAD_DECIDER_REROUTE_INTERVAL_SETTING
Copy link
Contributor

@DiannaHohensee DiannaHohensee Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this test will need a non default INTERVAL setting to be meaningful now. Perhaps randomize it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2abd946

value = "org.elasticsearch.cluster.routing.allocation.WriteLoadConstraintMonitor:DEBUG",
reason = "ensure we're skipping reroute for the right reason"
)
public void testRerouteIsCalledBeforeMinimumIntervalHasPassedIfNewNodesBecomeHotSpotted() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, maybe a high non-default INTERVAL setting so we're sure it's not applicable here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also done in 2abd946

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. There are a couple test fixes to set a non zero interval setting to be realistic again, but that's straightforward.

@nicktindall nicktindall merged commit 31e3c55 into elastic:main Aug 29, 2025
33 checks passed
JeremyDahlgren pushed a commit to JeremyDahlgren/elasticsearch that referenced this pull request Aug 29, 2025
@nicktindall nicktindall deleted the ES-119922_implement_WriteLoadConstraintMontitor branch September 3, 2025 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants