-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Improve memory estimation methods accuracy in TrainedModelAssignmentRebalancer and related classes #133930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Improve memory estimation methods accuracy in TrainedModelAssignmentRebalancer and related classes #133930
Conversation
…mentRebalancer and related classes
|
Pinging @elastic/ml-core (Team:ML) |
|
Hi @valeriy42, I've created a changelog YAML for you. |
…eck' into bug/explain-assignment-memory-check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but one question
| int existingAllocationsOnNode = assignmentPlan.assignments(deployment) | ||
| .map( | ||
| assignments -> assignments.getOrDefault( | ||
| assignments.keySet().stream().filter(n -> n.id().equals(node.getId())).findFirst().orElse(null), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assignments is a Map, right?
So why not do assignment.getOrDefault(node, 0) instead of streaming/filtering the key set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assignments is a <Map<AssignmentPlan.Node, Integer>, while node is of type DiscoveryNode. That's why I need to compare both id's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks.
Just thinking out loud: shouldn't the return value of assignmentPlan.assignments be a Map<String, Integer> instead (the string being the node ID)? That sounds more useful. Is that a big refactoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AssignmentPlan.assignments(deployment) is used in 10 places in the main code and in 100 places in the test code. We can check if we can refactor it, but it should be in a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I agree with that. Then please add a comment here about this Node vs DiscoveryNode and that it could benefit from refactoring (to key string node ID) and it lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #134030 so it won't get lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another refactor to consider is making the explainAssignment() function part of the AssignmentPlan class. The code here is trying to reverse engineer the planners decision making and it's easy to get out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| weighedAllocationsScore += (1 + 0.1 * (m.currentAllocationsByNodeId().containsKey(n.id()) ? 1 : 0)) * modelAssignments | ||
| .get(n); | ||
| memoryScore -= (nodeAllocations.getValue() > 0 ? m.memoryBytes() : 0); | ||
| memoryScore -= (nodeAllocations.getValue() > 0 ? m.estimateMemoryUsageBytes(nodeAllocations.getValue()) : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AssigmentPlan.Deployment::memoryBytes() is trappy as estimateMemoryUsageBytes() should always be used instead.
Because AssigmentPlan.Deployment is a record it will always have a public accessor for the memoryBytes field. The only way to stop people using it that I can think of is to override the accessor
@Override
public long memoryBytes() {
throw new UnsupportedOperationException("use estimateMemoryUsageBytes(int allocations) instead");
}
| int existingAllocationsOnNode = assignmentPlan.assignments(deployment) | ||
| .map( | ||
| assignments -> assignments.getOrDefault( | ||
| assignments.keySet().stream().filter(n -> n.id().equals(node.getId())).findFirst().orElse(null), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another refactor to consider is making the explainAssignment() function part of the AssignmentPlan class. The code here is trying to reverse engineer the planners decision making and it's easy to get out of sync.
BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main
BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main
BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main
BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main
BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main
This PR improves the way the assignment explanation routine is created. Previously, the amount of insufficient memory available on the node was calculated incorrectly. It also replaces the usage of allocation-independent
memoryBytes()with allocation-dependentestimateMemoryUsageBytes()in several places.