-
Couldn't load subscription status.
- Fork 25.6k
Use the last good NodeUsageStatsForThreadPools when a node returns an error #133896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if (nodeUsageStatsForThreadPools != null) { | ||
| cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether it makes sense to cache these things forever, or put some limit on how long we consider them to be better than nothing. I can't imagine being part of the cluster, but returning errors for node usage stats requests is a situation that persists for very long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the last seen value for each node is cached. That doesn't seem expensive to save and it'll be refreshed frequently.
A WARN message log for each node that fails to respond would be good, along with its error cause/msg. It shouldn't happen often, so I don't expect it'll be noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was more worried about whether there's a point where the cached value is so stale it's not useful, but I think it's probably always better than nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, always better than nothing 👍
A a cluster where a node is failing repeatedly to return stats probably has much bigger problems than this stale value.
server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes seem on track to me 👍
Have you had a chance to explore the ClusterInfoService and ClusterState updates, whether it's possible for those two pieces of state passed into a balancing computation to be out of sync in regards to newly added nodes? I didn't investigate, but I was wondering whether that was possible. That could leave the balancer looking up a node ID that doesn't exist in the nodeUsageStats (was never fetched). A newly removed node probably can't do any harm, since the nodeID would never be looked up in the nodeUsageStats. Ah, I see you've opened #133901, too. I missed that initially. All set.
| if (nodeUsageStatsForThreadPools != null) { | ||
| cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the last seen value for each node is cached. That doesn't seem expensive to save and it'll be refreshed frequently.
A WARN message log for each node that fails to respond would be good, along with its error cause/msg. It shouldn't happen often, so I don't expect it'll be noisy.
| for (FailedNodeException failedNodeException : response.failures()) { | ||
| final var nodeUsageStatsForThreadPools = lastNodeUsageStatsPerNode.get(failedNodeException.nodeId()); | ||
| if (nodeUsageStatsForThreadPools != null) { | ||
| cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can lastNodeUsageStatsPerNode be returned directly instead? putAll above adds the new values for the nodeId keys. So whatever nodes are missing in the new response will not be overridden in lastNodeUsageStatsPerNode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes perhaps... I wonder if there's some way it could include values for nodes we didn't request? probably not if things happen in the sequence we expect them to happen. I will come back to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer not to, because lastNodeUsageStasPerNode is internal state, it's mutable and it will be mutated by the collector (to expire values for nodes no longer in the cluster). I think the way it is is more explicit, we take the cached value for any node in response.failures() and we return a static map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, good point about immutability 👍
Would it be sufficient to return a copy of lastNodeUsageStatsPerNode and skip this whole for-loop? No longer present nodes have already been filtered out in a prior stage, and the successful node responses were applied above.
server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java
Show resolved
Hide resolved
| return new NodeUsageStatsForThreadPoolsAction.NodeResponse( | ||
| localNode, | ||
| new NodeUsageStatsForThreadPools(localNode.getId(), perThreadPool) | ||
| new NodeUsageStatsForThreadPools(localNode.getId(), Map.of(ThreadPool.Names.WRITE, threadPoolUsageStats)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to do this to use Maps.copyMapWithAddedOrReplacedEntry, seems like it should be immutable anyhow.
| return ClusterInfoServiceUtils.refresh(((InternalClusterInfoService) clusterInfoService)); | ||
| } | ||
| return null; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed useful to return the actual ClusterInfo here?
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
I think it's an ok solution, but stale metrics is consumer's tolerance, not producer's job. Producer does not know what degree of staleness is ok. So producer should produce fresh set of metrics and does not mock missing data, and every consumer cache(share cache) when it's needed. And I would make it explicit in consumer's code (Decider/Monitor) when we use cached or fresh, otherwise it's all blurry. |
I guess it depends if you consider the collector a producer or a consumer. It would be annoying to have to implement this in multiple places, but I see your point. Perhaps we could add a timestamp to indicate the age of the metrics if it becomes important? |
|
Maybe have a different data structure inside ClusterInfo for last known measurement, in case latest is missing. Attaching Timestamp sounds good. |
| return new NodeUsageStatsForThreadPoolsAction.NodeResponse( | ||
| localNode, | ||
| new NodeUsageStatsForThreadPools(localNode.getId(), perThreadPool) | ||
| new NodeUsageStatsForThreadPools(localNode.getId(), Map.of(ThreadPool.Names.WRITE, threadPoolUsageStats), Instant.now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timestamp being added on the source node could be problematic if there were clock skew in the cluster. I wonder if it should be recorded on the client side instead.
Or any "don't trust this if it's older than" should be of a magnitude that we don't need to worry about clock skew?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not enthusiastic about the timestamp here. There's no use for the timestamp in the code and it's not obviously logged anyplace. It's invasive to add and seems to be trying to solve a problem we don't have.
Logging a WARN message whenever we fail to get fresh stats from a node would be sufficient to convey the time when that happens -- very rarely -- and that an issue occurred. It's reasonable to log a WARN message because the cluster is going to be in distress if there are repeated failures to fetch stats from a single or multiple nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fair, it also requires the addition of a transport version. @mhl-b wdyt? I can easily revert it. I think I'm with @DiannaHohensee, it's something we can add when we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with that. I still think unbounded staleness is useless, even harmful. Utilization from 5 minutes ago has no meaning. Maybe we should allow only one missing measurement, without timestamp, but if missed twice we dont report anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's something we can add when we need it.
I dont think we can tell for sure once we blend together fresh and stale metrics. It would be some lagging node, that start to impact allocation decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you would track count of misses then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean use the timestamp. Once it's over a certain age we log a warning (I think we do something similar for autoscaling metrics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then if we see it happening a lot or implicated in issues we can decide what to do about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean you will keep current version with instant? But rather source node, use client-side time tracking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably makes more sense to track it on the client (in the collector) that way we can probably avoid transport version changes, and don't have to worry about clock skew.
| } | ||
| } | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant because it's a record
I added a timestamp to the records. I don't think its necessary to expose details about successful fetches to the consumer, if someone cares about the age of a record they should be able to determine that from the timestamp? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another look. If there's a good reason to add a timestamp, then that'd be fine. I can't currently see one, though, so that needs explanation.
| if (nodeUsageStatsForThreadPools != null) { | ||
| cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, always better than nothing 👍
A a cluster where a node is failing repeatedly to return stats probably has much bigger problems than this stale value.
| return new NodeUsageStatsForThreadPoolsAction.NodeResponse( | ||
| localNode, | ||
| new NodeUsageStatsForThreadPools(localNode.getId(), perThreadPool) | ||
| new NodeUsageStatsForThreadPools(localNode.getId(), Map.of(ThreadPool.Names.WRITE, threadPoolUsageStats), Instant.now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not enthusiastic about the timestamp here. There's no use for the timestamp in the code and it's not obviously logged anyplace. It's invasive to add and seems to be trying to solve a problem we don't have.
Logging a WARN message whenever we fail to get fresh stats from a node would be sufficient to convey the time when that happens -- very rarely -- and that an issue occurred. It's reasonable to log a WARN message because the cluster is going to be in distress if there are repeated failures to fetch stats from a single or multiple nodes.
...ernalClusterTest/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollectorIT.java
Outdated
Show resolved
Hide resolved
| for (FailedNodeException failedNodeException : response.failures()) { | ||
| final var nodeUsageStatsForThreadPools = lastNodeUsageStatsPerNode.get(failedNodeException.nodeId()); | ||
| if (nodeUsageStatsForThreadPools != null) { | ||
| cachedValuesForFailed.put(failedNodeException.nodeId(), nodeUsageStatsForThreadPools); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, good point about immutability 👍
Would it be sufficient to return a copy of lastNodeUsageStatsPerNode and skip this whole for-loop? No longer present nodes have already been filtered out in a prior stage, and the successful node responses were applied above.
| // Add in the last-seen usage stats for any nodes that failed to respond | ||
| final Map<String, NodeUsageStatsForThreadPools> cachedValuesForFailed = new HashMap<>(returnedUsageStats); | ||
| for (FailedNodeException failedNodeException : response.failures()) { | ||
| final var nodeUsageStatsForThreadPools = lastNodeUsageStatsPerNode.get(failedNodeException.nodeId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, instead of using last value as a fallback, can we have a specific NodeUsageStatsForThreadPools object representing failure? Other parts of the code will have to check it explicitly to make decision, e.g. write load decider potentially rejecting allocation?
My thinking is that we probably don't want to fallback more than a few times, i.e. the last value needs to expire at certain point. I guess it's probably the reason you added the received timestamp? In that case, we still have to address what we use to indicate a "failed and expired" entry. If a node fails to respond ClusterInfo polling, it is likely overloaded, e.g. CBE. So seems safter to assume rejection or overall no movement for the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Had only minor comments.
server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollector.java
Outdated
Show resolved
Hide resolved
| import static org.hamcrest.Matchers.equalTo; | ||
| import static org.hamcrest.Matchers.hasKey; | ||
|
|
||
| @ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this annotation necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, fixed in d5b17ab
| // The next response should also contain our fake values | ||
| refreshClusterInfoAndAssertThreadPoolHasStats( | ||
| dataNodeClusterService.localNode().getId(), | ||
| threadPoolName, | ||
| totalThreadPoolThreads, | ||
| averageThreadPoolUtilization, | ||
| maxThreadPoolQueueLatencyMillis | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add one more step to ensure new value is used when the node is recovered from error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, added in ee9acc5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| // Now simulate an error | ||
| dataNodeTransportService.clearInboundRules(); | ||
| dataNodeTransportService.addRequestHandlingBehavior( | ||
| TransportNodeUsageStatsForThreadPoolsAction.NAME + "[n]", | ||
| (handler, request, channel, task) -> { | ||
| channel.sendResponse(new Exception("simulated error")); | ||
| } | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When a node returns an error response to the
NodeUsageStatsForThreadPoolsCollector, use the most recent good value we've seen for that node, rather than returning nothing.I believe this is the more important scenario to cover, I don't think we need to do anything special for nodes with no
NodeUsageStatsForThreadPoolsvalue in theClusterInfobecause that situation should be very brief, because we refresh ourClusterInfoeagerly when a new node joins the cluster.Relates: ES-12621