Skip to content

Conversation

JeremyDahlgren
Copy link
Contributor

Add tracking of the number of compute() calls and total iterations between convergences in the DesiredBalanceComputer, along with the time since the last convergence. These are included in the log message when the computer doesn't converge.

Closes #100850.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Apr 1, 2025
@JeremyDahlgren JeremyDahlgren added Team:Distributed Coordination Meta label for Distributed Coordination team :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Apr 1, 2025
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Apr 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@JeremyDahlgren JeremyDahlgren added >enhancement needs:triage Requires assignment of a team area label labels Apr 1, 2025
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Apr 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @JeremyDahlgren, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, I left some comments

Comment on lines 360 to 361
"Desired balance computation for [{}] converged after [{}] and [{}] iterations, "
+ "[{}] compute() calls with [{}] total iterations since last convergence [{}] ago",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit, I'd use a multi-line string here, they're slightly easier to read and less hassle to modify in future:

Suggested change
"Desired balance computation for [{}] converged after [{}] and [{}] iterations, "
+ "[{}] compute() calls with [{}] total iterations since last convergence [{}] ago",
"""
Desired balance computation for [{}] converged after [{}] and [{}] iterations, \
[{}] compute() calls with [{}] total iterations since last convergence [{}] ago""",

Although I note that we use regular string concatenation in other places nearby already. Up to you.

Comment on lines 405 to 406
"Desired balance computation for [{}] is still not converged after [{}] and [{}] iterations, "
+ "[{}] compute() calls with [{}] total iterations since last convergence [{}] ago",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message is user-visible and I think as worded here it might raise some support questions - what is compute() for instance? Could we instead rephrase it so that the numbers in still not converged after [{}] and [{}] iterations reflect the activity since the last convergence rather than the last interrupt? Then maybe say something like resumed computation [{}] times with [{}] iterations since the last resumption [{}] ago to expose the numbers about the current computation. And maybe only do that if numComputeCallsSinceLastConverged > 1 so the message can be simpler when possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored per your recommendation. I have an if/else here testing the number of computations, and also for the convergence message. There is some duplication going on in these two areas, but decided to leave as is. Please let me know what you think.

Add tracking of the number of compute() calls and total iterations
between convergences in the DesiredBalanceComputer, along with the
time since the last convergence.  These are included in the log
message when the computer doesn't converge.

Closes elastic#100850.
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM great stuff

@JeremyDahlgren JeremyDahlgren merged commit 4c979aa into elastic:main Apr 4, 2025
17 checks passed
andreidan pushed a commit to andreidan/elasticsearch that referenced this pull request Apr 9, 2025
…iredBalanceComputer (elastic#126008)

Add tracking of the number of compute() calls and total iterations
between convergences in the DesiredBalanceComputer, along with the
time since the last convergence.  These are included in the log
message when the computer doesn't converge.

Closes elastic#100850.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Desired-balance warn threshold logging should accumulate across restarts

3 participants