- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.6k
Model movements to nodes with no existing node stats #133901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| .map(NodeUsageStatsForThreadPools::threadPoolUsageStatsMap) | ||
| .map(m -> m.get(ThreadPool.Names.WRITE)) | ||
| .mapToInt(NodeUsageStatsForThreadPools.ThreadPoolUsageStats::totalThreadPoolThreads) | ||
| .max(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the max thread pool size is probably "optimistic", we could also be pessimistic and assume the minimum pool size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't really know what would be best, can we pick the first node in the map? Then we avoid any performance slow downs with streams or iteration, since this path is going to be hit a lot.
| ); | ||
| }); | ||
| } else { | ||
| logger.debug("No nodes found to estimate write thread pool size, skipping"); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get here, there were no other nodes in the ClusterInfo to base our estimate off of. In this case we could also assume the same pool size as the local node perhaps? but we don't have that information in here. This also seems very unlikely to actually occur.
| largestWriteThreadPool.getAsInt(), | ||
| updateNodeUtilizationWithShardMovements(0.0f, (float) writeLoadDelta, largestWriteThreadPool.getAsInt()), | ||
| 0 | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fudge a NodeUsageStatsForThreadPools that is good for our usage (i.e. it has a single entry for the WRITE pool). Nobody else relies on these stats at the moment, so that's probably OK, but if another decider starts using these stats we'd need to be careful about doing this maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also probably do better in the event a node with no stats has a shard moved off of it. In that case it doesn't make sense to assume the node was empty before the move, because clearly it was not, but then it seems very unlikely we'd find ourselves in that situation so I don't know how much effort we want to put into improving that estimate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fudge a NodeUsageStatsForThreadPools that is good for our usage (i.e. it has a single entry for the WRITE pool). Nobody else relies on these stats at the moment, so that's probably OK, but if another decider starts using these stats we'd need to be careful about doing this maybe.
Just FYI, the Decider does take some action based on the ClusterInfo it receives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also probably do better in the event a node with no stats has a shard moved off of it. In that case it doesn't make sense to assume the node was empty before the move, because clearly it was not, but then it seems very unlikely we'd find ourselves in that situation so I don't know how much effort we want to put into improving that estimate.
I agree, not worth the effort. Not sure the scenario can even happen, actually. A new node joins, we'd have to assign shards to the node, first, which means simulation begins. The simulation might be a little off because of the thread pool thread count guess, but otherwise fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some notes and suggestions.
We could also reconsider saving originalNodeUsageStatsForThreadPools and maintaining a diff, and instead immediately apply the shard movement change to the stored values. I think allowing negative utilization values would preserve precision -- IIRC, that was the argument for keeping a diff. Just a note in case that simplifies the work, or could lead to fewer loops / better performance.
I think I noticed recently in a test that we fetch the ClusterInfo from the simulator a LOT. Hard to say without exploring further / performance testing. Not a problem to solve here, but to explain my interest in minimizing the work in simulatedNodeUsageStatsForThreadPools().
| .map(NodeUsageStatsForThreadPools::threadPoolUsageStatsMap) | ||
| .map(m -> m.get(ThreadPool.Names.WRITE)) | ||
| .mapToInt(NodeUsageStatsForThreadPools.ThreadPoolUsageStats::totalThreadPoolThreads) | ||
| .max(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't really know what would be best, can we pick the first node in the map? Then we avoid any performance slow downs with streams or iteration, since this path is going to be hit a lot.
| largestWriteThreadPool.getAsInt(), | ||
| updateNodeUtilizationWithShardMovements(0.0f, (float) writeLoadDelta, largestWriteThreadPool.getAsInt()), | ||
| 0 | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fudge a NodeUsageStatsForThreadPools that is good for our usage (i.e. it has a single entry for the WRITE pool). Nobody else relies on these stats at the moment, so that's probably OK, but if another decider starts using these stats we'd need to be careful about doing this maybe.
Just FYI, the Decider does take some action based on the ClusterInfo it receives.
| largestWriteThreadPool.getAsInt(), | ||
| updateNodeUtilizationWithShardMovements(0.0f, (float) writeLoadDelta, largestWriteThreadPool.getAsInt()), | ||
| 0 | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also probably do better in the event a node with no stats has a shard moved off of it. In that case it doesn't make sense to assume the node was empty before the move, because clearly it was not, but then it seems very unlikely we'd find ourselves in that situation so I don't know how much effort we want to put into improving that estimate.
I agree, not worth the effort. Not sure the scenario can even happen, actually. A new node joins, we'd have to assign shards to the node, first, which means simulation begins. The simulation might be a little off because of the thread pool thread count guess, but otherwise fine.
| Map<String, NodeUsageStatsForThreadPools> nodeUsageStatsForThreadPools | ||
| ) { | ||
| // Assume the new node has the same size thread pool as the largest existing node | ||
| final OptionalInt largestWriteThreadPool = originalNodeUsageStatsForThreadPools.values() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we have to build the new node basis every time we calculated an updated ClusterInfo. Could we instead immediately add a new 0 utilization node to originalNodeUsageStatsForThreadPools? Then we can move forward always expecting an entry in originalNodeUsageStatsForThreadPools for every node. Might simplify the code.
| adjustedNodeUsageStatsForThreadPools.put(entry.getKey(), entry.getValue()); | ||
| } | ||
| } | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding a second non-trivial for-loop in addUsageStatsForAnyNodesNotPresentInOriginalNodeUsageStatsForThreadPools, could we replace the above original for-loop with iteration of the simulatedNodeWriteLoadDeltas data structure to begin with?
Then use something like originalNodeUsageStatsForThreadPools.forEach(adjustedNodeUsageStatsForThreadPools::putIfAbsent) to straight copy the remainder.
This would be in combination with my other suggestion to add the new node to originalNodeUsageStatsForThreadPools beforehand for simplicity.
| As discussed, I think we can avoid making this change, I added a test to confirm that we do trigger a  If we agree I will close this? @DiannaHohensee @ywangd @mhl-b | 
| Sounds like it's not possible for a reroute request to use ClusterState containing a node not present in the ClusterInfo loaded for the balancing round? If so, agreed 👍 | 
First attempt a modelling movements to nodes that have no existing node stats
still a few questions lingering. Put up for discussion.
In #133896, we cover the case where a node returned an error for node stats, but had previously returned values, and this PR should address the case where a node was added since the last
ClusterInfowas returned, and a new one is yet to be generated.Relates: ES-12621