-
Notifications
You must be signed in to change notification settings - Fork 25.6k
allocation: add balancer round summary as metrics #136043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
allocation: add balancer round summary as metrics #136043
Conversation
This commit adds the BalancerRoundSummary as a collection of APM/open telemetry metrics. These are already logged. The summary collected every ten seconds or so is set as the current state into the telemetry metrics class (AllocationBalancingRoundMetrics). Whenever the telemetry runs, each metric picks up its current view.
Hi @schase-es, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some comments, I think we should discuss with Dianna what we're aiming for with the node deltas
bind(AllocationStatsService.class).toInstance(allocationStatsService); | ||
bind(TelemetryProvider.class).toInstance(telemetryProvider); | ||
bind(DesiredBalanceMetrics.class).toInstance(desiredBalanceMetrics); | ||
bind(AllocationBalancingRoundMetrics.class).toInstance(balancingRoundMetrics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binding like this is only necessary if we use the instance in a cluster annotated @Inject
, I'm not sure if we do with the AllocationBalancingRoundMetrics
? It probably doesn't matter, but I think in general we don't do it unless we need to.
assert summary != null : "balancing round metrics cannot be null"; | ||
|
||
nodeNameToWeightChangesRef.set(summary.nodeNameToWeightChanges()); | ||
if (enableSending) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to set the nodeNameToWeightChangesRef
if enableSending = false
?
long shardCount = nodeWeightChanges.baseWeights().shardCount() + nodeWeightChanges.weightsDiff().shardCountDiff(); | ||
metrics.add(new LongWithAttributes(shardCount, getNodeAttributes(nodeWeights.getKey()))); | ||
} | ||
return metrics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want the current values of shard count, disk usage and write load, because we already have those in the cluster balance dashboard
Perhaps the deltas are more interesting? I think number of balancing rounds, and shard movements are definitely valuable as you have them published now, but I'm less clear on the value of the specific node shard/weight/disk usage deltas/values that we don't get already from existing metrics.
Maybe the absolute amount of change (e.g. if one node loses X weight and another gains X the sum would be 2X) in those values might be interesting?
If we plotted any of these values for the cluster by simply adding them together, they'd sum to zero I think? because for every node gaining X shards there are other node(s) losing X shards. @DiannaHohensee might have more clarity on the direction here.
I think for serverless as well, disk usage will always be zero after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed I think we should reduce the scope on this PR to just the two metrics we know how to publish and do a second PR when we've discussed how we'd like to publish the write load/shard count/disk usage
|
||
private Map<String, Object> getNodeAttributes(String nodeId) { | ||
return Map.of("node_id", nodeId); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think metrics by default get a node_id
and a corresponding node_name
, for the node they are emitted from. Perhaps we should use something other than node_id here? Or set node_name
also?
This commit adds the BalancerRoundSummary as a collection of APM/open telemetry metrics. These are already logged. The summary collected every ten seconds or so is set as the current state into the telemetry metrics class (AllocationBalancingRoundMetrics). Whenever the telemetry runs, each metric picks up its current view.
Fixes: ES-10343