Skip to content

Conversation

@DiannaHohensee
Copy link
Contributor

Lays out the balancer summary information that we want to collect and
eventually report. Only the class scaffolding, still needs
implementation.

Relates ES-10341


This is the outline of the balancer round summary stats I want to track. I'll need to run around the code to track and collect the desired information: I expect I'll need to collect some metrics during reconcile(), there's already AllocationStats that tracks some information. ES-10260 outlines what I'm doing in this patch, as well as where I'm headed. Eventually I need to push the metrics from the *SummaryService into the DesiredBalanceMetrics to get pulled by APM (that's a later ticket, ES-10343, but is where things are headed).

@DiannaHohensee DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team labels Jan 10, 2025
@DiannaHohensee DiannaHohensee self-assigned this Jan 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@DiannaHohensee DiannaHohensee force-pushed the 2025/01/10/ES-10341-balancer-summaries branch from ebd4448 to 20ae391 Compare January 10, 2025 05:17
Lays out the balancer summary information that we want to collect and
eventually report. Only the class scaffolding, still needs
implementation.
@nicktindall
Copy link
Contributor

I'm interested in how we expose this to the outside world. We could probably publish metrics for individual measures, but I wonder if it's worth including more structured summary in e.g. /_internal/desired_balance or similar.

Secondly, I'm still not sure I understand the desired balance allocator as well as as I should.
In a single balancing round, are all movements considered to be because of the event that triggered it? e.g. a round triggered by event NodeShutdownAndRemoval, would it be possible for that round to include

  • numAllocationDeciderForcedShardMoves > 0
  • numRebalancingShardMoves > 0

or would all moves in that round contribute to numShutdownForcedShardMoves

@DiannaHohensee
Copy link
Contributor Author

I'm interested in how we expose this to the outside world. We could probably publish metrics for individual measures, but I wonder if it's worth including more structured summary in e.g. /_internal/desired_balance or similar.

Right now we don't poll APIs to create dashboards: we query log messages, or pull metrics via APM. So I think exposing the metrics via an API, a new one or enhancing /_internal/desired_balance, would be useful, but doesn't provide what we are set up to use right now.

In a single balancing round, are all movements considered to be because of the event that triggered it? e.g. a round triggered by event NodeShutdownAndRemoval, would it be possible for that round to include
numAllocationDeciderForcedShardMoves > 0
numRebalancingShardMoves > 0
or would all moves in that round contribute to numShutdownForcedShardMoves

My thoughts would be, in the case of a node shutdown, that shards being moved away from the shutting down node should be tracked in numAllocationDeciderForcedShardMoves, and any other shard movements would be counted in numRebalancingShardMoves (shards moving between non-shutting down nodes). I think it'd be interesting what other shuffling results from the necessary moves 🤔


I'm closing this PR and have replaced it with a new PR. After discussing with Pooya, we're going with an end-to-end patch with one simple metric implemented, to hopefully make things clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants