Skip to content

Conversation

schase-es
Copy link
Contributor

This commit adds the BalancerRoundSummary as a collection of APM/open telemetry metrics. These are already logged. The summary collected every ten seconds or so is set as the current state into the telemetry metrics class (AllocationBalancingRoundMetrics). Whenever the telemetry runs, each metric picks up its current view.

Fixes: ES-10343

This commit adds the BalancerRoundSummary as a collection of APM/open telemetry
metrics. These are already logged. The summary collected every ten seconds or so
is set as the current state into the telemetry metrics class
(AllocationBalancingRoundMetrics). Whenever the telemetry runs, each metric
picks up its current view.
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 labels Oct 6, 2025
@schase-es schase-es added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement and removed needs:triage Requires assignment of a team area label labels Oct 8, 2025
@schase-es schase-es marked this pull request as draft October 8, 2025 20:15
@elasticsearchmachine
Copy link
Collaborator

Hi @schase-es, I've created a changelog YAML for you.

Copy link
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments, I think we should discuss with Dianna what we're aiming for with the node deltas

bind(AllocationStatsService.class).toInstance(allocationStatsService);
bind(TelemetryProvider.class).toInstance(telemetryProvider);
bind(DesiredBalanceMetrics.class).toInstance(desiredBalanceMetrics);
bind(AllocationBalancingRoundMetrics.class).toInstance(balancingRoundMetrics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binding like this is only necessary if we use the instance in a cluster annotated @Inject, I'm not sure if we do with the AllocationBalancingRoundMetrics? It probably doesn't matter, but I think in general we don't do it unless we need to.

assert summary != null : "balancing round metrics cannot be null";

nodeNameToWeightChangesRef.set(summary.nodeNameToWeightChanges());
if (enableSending) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set the nodeNameToWeightChangesRef if enableSending = false ?

long shardCount = nodeWeightChanges.baseWeights().shardCount() + nodeWeightChanges.weightsDiff().shardCountDiff();
metrics.add(new LongWithAttributes(shardCount, getNodeAttributes(nodeWeights.getKey())));
}
return metrics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want the current values of shard count, disk usage and write load, because we already have those in the cluster balance dashboard

Perhaps the deltas are more interesting? I think number of balancing rounds, and shard movements are definitely valuable as you have them published now, but I'm less clear on the value of the specific node shard/weight/disk usage deltas/values that we don't get already from existing metrics.

Maybe the absolute amount of change (e.g. if one node loses X weight and another gains X the sum would be 2X) in those values might be interesting?

If we plotted any of these values for the cluster by simply adding them together, they'd sum to zero I think? because for every node gaining X shards there are other node(s) losing X shards. @DiannaHohensee might have more clarity on the direction here.

I think for serverless as well, disk usage will always be zero after this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed I think we should reduce the scope on this PR to just the two metrics we know how to publish and do a second PR when we've discussed how we'd like to publish the write load/shard count/disk usage


private Map<String, Object> getNodeAttributes(String nodeId) {
return Map.of("node_id", nodeId);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think metrics by default get a node_id and a corresponding node_name, for the node they are emitted from. Perhaps we should use something other than node_id here? Or set node_name also?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants