[ML] Improve memory estimation methods accuracy in TrainedModelAssignmentRebalancer and related classes #133930

valeriy42 · 2025-09-01T12:13:50Z

This PR improves the way the assignment explanation routine is created. Previously, the amount of insufficient memory available on the node was calculated incorrectly. It also replaces the usage of allocation-independent memoryBytes() with allocation-dependent estimateMemoryUsageBytes() in several places.

…mentRebalancer and related classes

elasticsearchmachine · 2025-09-01T12:14:30Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2025-09-01T12:14:53Z

Hi @valeriy42, I've created a changelog YAML for you.

…eck' into bug/explain-assignment-memory-check

jan-elastic

LGTM, but one question

jan-elastic · 2025-09-01T14:17:10Z

...n/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentRebalancer.java

+        int existingAllocationsOnNode = assignmentPlan.assignments(deployment)
+            .map(
+                assignments -> assignments.getOrDefault(
+                    assignments.keySet().stream().filter(n -> n.id().equals(node.getId())).findFirst().orElse(null),


assignments is a Map, right?

So why not do assignment.getOrDefault(node, 0) instead of streaming/filtering the key set?

assignments is a <Map<AssignmentPlan.Node, Integer>, while node is of type DiscoveryNode. That's why I need to compare both id's.

OK, thanks.

Just thinking out loud: shouldn't the return value of assignmentPlan.assignments be a Map<String, Integer> instead (the string being the node ID)? That sounds more useful. Is that a big refactoring?

AssignmentPlan.assignments(deployment) is used in 10 places in the main code and in 100 places in the test code. We can check if we can refactor it, but it should be in a different PR.

OK, I agree with that. Then please add a comment here about this Node vs DiscoveryNode and that it could benefit from refactoring (to key string node ID) and it lgtm

I created #134030 so it won't get lost.

Another refactor to consider is making the explainAssignment() function part of the AssignmentPlan class. The code here is trying to reverse engineer the planners decision making and it's easy to get out of sync.

jan-elastic

LGTM

davidkyle

LGTM

davidkyle · 2025-09-03T10:25:31Z

...l/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/planning/AssignmentPlan.java

                    weighedAllocationsScore += (1 + 0.1 * (m.currentAllocationsByNodeId().containsKey(n.id()) ? 1 : 0)) * modelAssignments
                        .get(n);
-                    memoryScore -= (nodeAllocations.getValue() > 0 ? m.memoryBytes() : 0);
+                    memoryScore -= (nodeAllocations.getValue() > 0 ? m.estimateMemoryUsageBytes(nodeAllocations.getValue()) : 0);


AssigmentPlan.Deployment::memoryBytes() is trappy as estimateMemoryUsageBytes() should always be used instead.

Because AssigmentPlan.Deployment is a record it will always have a public accessor for the memoryBytes field. The only way to stop people using it that I can think of is to override the accessor

@Override public long memoryBytes() { throw new UnsupportedOperationException("use estimateMemoryUsageBytes(int allocations) instead"); }

davidkyle · 2025-09-03T10:29:00Z

...n/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentRebalancer.java

+        int existingAllocationsOnNode = assignmentPlan.assignments(deployment)
+            .map(
+                assignments -> assignments.getOrDefault(
+                    assignments.keySet().stream().filter(n -> n.id().equals(node.getId())).findFirst().orElse(null),


Another refactor to consider is making the explainAssignment() function part of the AssignmentPlan class. The code here is trying to reverse engineer the planners decision making and it's easy to get out of sync.

BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main

[ML] Improve memory estimation methods accuracy in TrainedModelAssign…

7099062

…mentRebalancer and related classes

valeriy42 added :ml Machine learning Team:ML Meta label for the ML team >bug labels Sep 1, 2025

elasticsearchmachine added the v9.2.0 label Sep 1, 2025

valeriy42 added v9.1.4 v9.0.7 v8.18.7 v8.19.4 labels Sep 1, 2025

Update docs/changelog/133930.yaml

53a3262

valeriy42 self-assigned this Sep 1, 2025

valeriy42 requested a review from davidkyle September 1, 2025 12:15

valeriy42 added 2 commits September 1, 2025 14:50

clean up

726bc98

Merge remote-tracking branch 'origin/bug/explain-assignment-memory-ch…

c2d6e10

…eck' into bug/explain-assignment-memory-check

jan-elastic requested changes Sep 1, 2025

View reviewed changes

valeriy42 requested a review from jan-elastic September 1, 2025 14:42

jan-elastic approved these changes Sep 2, 2025

View reviewed changes

valeriy42 added 2 commits September 2, 2025 16:12

add implementation node

b7686c3

Merge branch 'main' into bug/explain-assignment-memory-check

05ab306

davidkyle approved these changes Sep 3, 2025

View reviewed changes

valeriy42 merged commit 7d3903f into elastic:main Sep 3, 2025
33 checks passed

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Sep 11, 2025

Mirror upstream elastic#133930 as single snapshot commit for AI review

86339ec

BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Sep 16, 2025

Mirror upstream elastic#133930 as single snapshot commit for AI review

ff2cde3

BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Oct 8, 2025

Mirror upstream elastic#133930 as single snapshot commit for AI review

23fbfef

BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Oct 16, 2025

Mirror upstream elastic#133930 as single snapshot commit for AI review

501728b

BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Oct 24, 2025

Mirror upstream elastic#133930 as single snapshot commit for AI review

305718c

BASE=647356e7d47d947e4deb37c402242dba009b5233 HEAD=05ab306852611b2a29c53d6646a8664fc7e93676 Branch=main

[ML] Improve memory estimation methods accuracy in TrainedModelAssignmentRebalancer and related classes #133930

[ML] Improve memory estimation methods accuracy in TrainedModelAssignmentRebalancer and related classes #133930

Uh oh!

Conversation

valeriy42 commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 1, 2025

Uh oh!

elasticsearchmachine commented Sep 1, 2025

Uh oh!

jan-elastic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-elastic left a comment

Choose a reason for hiding this comment

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

valeriy42 commented Sep 1, 2025 •

edited

Loading