Skip to content

Conversation

@davidkyle
Copy link
Member

@davidkyle davidkyle commented May 30, 2025

When updating a model deployment (for example changing the number of allocations) calculating the new deployment is an expensive operation so it is done in a separate thread outside of the ClusterStateUpdateTask. However, if there was another clusterstate update while computing the new deployment then submitting the cluster state fails because the version of the cluster state used to calculate the update is now lower than the version of the latest state.

The fix is quite easy, compute the model deployment update outside of the ClusterStateUpdateTask then merge it with the latest state when executing the task. The code already has a check that the deployment update is compatible with the new state (areClusterStatesCompatibleForRebalance(...)) making it safe to merge the new state.

Closes #121165

@davidkyle davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 v9.0.3 v8.18.3 labels May 30, 2025
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label May 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)


private static final Logger logger = LogManager.getLogger(TrainedModelAssignmentClusterService.class);

private static final TransportVersion RENAME_ALLOCATION_TO_ASSIGNMENT_TRANSPORT_VERSION = TransportVersions.V_8_3_0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These version checks are redundant in 9.0 and 9.1. The 8.x backports will need to keep them however.

ActionListener<ClusterState> updatedStateListener = ActionListener.wrap(
updatedState -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() {
ActionListener<TrainedModelAssignmentMetadata.Builder> updatedAssignmentListener = ActionListener.wrap(
updatedAssignment -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix, here the new assignment state is passed rather than the updated cluster state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the TrainedModelAssignmentMetadata changed in the meantime?

That's a similar bug to the existing one, right? But much smaller, because the TrainedModelAssignmentMetadata changes less often than the ClusterState.

Should we protect against that? For example, only replace the TrainedModelAssignmentMetadata in the ClusterState if it's identical to the one we started the update computation with? And if it has changed, try this process again. More or less this paradigm:
https://github.com/elastic/elasticsearch/blob/main/test/framework/src/main/java/org/elasticsearch/common/util/MockBigArrays.java#L728-L742.

Maybe that's overkill and too complicated though. WDYT?

Copy link
Contributor

@jan-elastic jan-elastic Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're deciding not to fix this, let's leave a comment about this small issue with this implementation and call it a day

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the areClusterStatesCompatibleForRebalance function it also checks that the TrainedModelAssignmentMetadata has not changed.

&& TrainedModelAssignmentMetadata.fromState(source).equals(TrainedModelAssignmentMetadata.fromState(target));

By comparing the starting state with the latest state before applying the updated TrainedModelAssignmentMetadata (which is a lightweight operation and can be done in the ClusterStateUpdateTask) the code is effectively performing a "compare and swap" paradigm as linked above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, cool! I missed that part of the code. Guess this all works then as is

final boolean isResetMode = MlMetadata.getMlMetadata(event.state()).isResetMode();
TrainedModelAssignmentMetadata modelAssignmentMetadata = TrainedModelAssignmentMetadata.fromState(event.state());
final String currentNode = event.state().nodes().getLocalNodeId();
final boolean isNewAllocationSupported = event.state()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another version change that is irrelevant for 9

# Conflicts:
#	muted-tests.yml
#	x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java
@davidkyle davidkyle added auto-backport Automatically create backport pull requests when merged v9.1.1 and removed auto-backport Automatically create backport pull requests when merged labels Jul 22, 2025
Copy link
Contributor

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left one comment. Certainly a big improvement!

Copy link
Contributor

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidkyle davidkyle merged commit 989f72b into elastic:main Jul 23, 2025
33 checks passed
elasticsearchmachine pushed a commit that referenced this pull request Jul 30, 2025
…ate (#128667) (#132159)

(cherry picked from commit 989f72b)

# Conflicts:
#	muted-tests.yml
#	x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java
elasticsearchmachine pushed a commit that referenced this pull request Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning Team:ML Meta label for the ML team >test Issues or PRs that are addressing/adding tests v9.0.5 v9.1.1 v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyTorchModelIT testUpdateDeployment_GivenAllocationsAreIncreased failure

3 participants