Skip to content

Conversation

@valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Sep 1, 2025

This PR improves the way the assignment explanation routine is created. Previously, the amount of insufficient memory available on the node was calculated incorrectly. It also replaces the usage of allocation-independent memoryBytes() with allocation-dependent estimateMemoryUsageBytes() in several places.

@valeriy42 valeriy42 added :ml Machine learning Team:ML Meta label for the ML team >bug labels Sep 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @valeriy42, I've created a changelog YAML for you.

@valeriy42 valeriy42 self-assigned this Sep 1, 2025
@valeriy42 valeriy42 requested a review from davidkyle September 1, 2025 12:15
Copy link
Contributor

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but one question

int existingAllocationsOnNode = assignmentPlan.assignments(deployment)
.map(
assignments -> assignments.getOrDefault(
assignments.keySet().stream().filter(n -> n.id().equals(node.getId())).findFirst().orElse(null),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assignments is a Map, right?

So why not do assignment.getOrDefault(node, 0) instead of streaming/filtering the key set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assignments is a <Map<AssignmentPlan.Node, Integer>, while node is of type DiscoveryNode. That's why I need to compare both id's.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks.

Just thinking out loud: shouldn't the return value of assignmentPlan.assignments be a Map<String, Integer> instead (the string being the node ID)? That sounds more useful. Is that a big refactoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssignmentPlan.assignments(deployment) is used in 10 places in the main code and in 100 places in the test code. We can check if we can refactor it, but it should be in a different PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I agree with that. Then please add a comment here about this Node vs DiscoveryNode and that it could benefit from refactoring (to key string node ID) and it lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #134030 so it won't get lost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another refactor to consider is making the explainAssignment() function part of the AssignmentPlan class. The code here is trying to reverse engineer the planners decision making and it's easy to get out of sync.

Copy link
Contributor

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

weighedAllocationsScore += (1 + 0.1 * (m.currentAllocationsByNodeId().containsKey(n.id()) ? 1 : 0)) * modelAssignments
.get(n);
memoryScore -= (nodeAllocations.getValue() > 0 ? m.memoryBytes() : 0);
memoryScore -= (nodeAllocations.getValue() > 0 ? m.estimateMemoryUsageBytes(nodeAllocations.getValue()) : 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssigmentPlan.Deployment::memoryBytes() is trappy as estimateMemoryUsageBytes() should always be used instead.

Because AssigmentPlan.Deployment is a record it will always have a public accessor for the memoryBytes field. The only way to stop people using it that I can think of is to override the accessor

        @Override
        public long memoryBytes() {
            throw new UnsupportedOperationException("use estimateMemoryUsageBytes(int allocations) instead");
        }

int existingAllocationsOnNode = assignmentPlan.assignments(deployment)
.map(
assignments -> assignments.getOrDefault(
assignments.keySet().stream().filter(n -> n.id().equals(node.getId())).findFirst().orElse(null),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another refactor to consider is making the explainAssignment() function part of the AssignmentPlan class. The code here is trying to reverse engineer the planners decision making and it's easy to get out of sync.

@valeriy42 valeriy42 merged commit 7d3903f into elastic:main Sep 3, 2025
33 checks passed
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Sep 11, 2025
BASE=647356e7d47d947e4deb37c402242dba009b5233
HEAD=05ab306852611b2a29c53d6646a8664fc7e93676
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Sep 16, 2025
BASE=647356e7d47d947e4deb37c402242dba009b5233
HEAD=05ab306852611b2a29c53d6646a8664fc7e93676
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Oct 8, 2025
BASE=647356e7d47d947e4deb37c402242dba009b5233
HEAD=05ab306852611b2a29c53d6646a8664fc7e93676
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Oct 16, 2025
BASE=647356e7d47d947e4deb37c402242dba009b5233
HEAD=05ab306852611b2a29c53d6646a8664fc7e93676
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Oct 24, 2025
BASE=647356e7d47d947e4deb37c402242dba009b5233
HEAD=05ab306852611b2a29c53d6646a8664fc7e93676
Branch=main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants