[ML] Fix test failure updating model deployment with stale cluster state. #128667

davidkyle · 2025-05-30T11:16:36Z

When updating a model deployment (for example changing the number of allocations) calculating the new deployment is an expensive operation so it is done in a separate thread outside of the ClusterStateUpdateTask. However, if there was another clusterstate update while computing the new deployment then submitting the cluster state fails because the version of the cluster state used to calculate the update is now lower than the version of the latest state.

The fix is quite easy, compute the model deployment update outside of the ClusterStateUpdateTask then merge it with the latest state when executing the task. The code already has a check that the deployment update is compatible with the new state (areClusterStatesCompatibleForRebalance(...)) making it safe to merge the new state.

Closes #121165

elasticsearchmachine · 2025-05-30T11:17:01Z

Pinging @elastic/ml-core (Team:ML)

davidkyle · 2025-05-30T11:17:54Z

...va/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java


    private static final Logger logger = LogManager.getLogger(TrainedModelAssignmentClusterService.class);

-    private static final TransportVersion RENAME_ALLOCATION_TO_ASSIGNMENT_TRANSPORT_VERSION = TransportVersions.V_8_3_0;


These version checks are redundant in 9.0 and 9.1. The 8.x backports will need to keep them however.

davidkyle · 2025-05-30T11:19:00Z

...va/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java

-        ActionListener<ClusterState> updatedStateListener = ActionListener.wrap(
-            updatedState -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() {
+        ActionListener<TrainedModelAssignmentMetadata.Builder> updatedAssignmentListener = ActionListener.wrap(
+            updatedAssignment -> submitUnbatchedTask("update model deployment", new ClusterStateUpdateTask() {


This is the fix, here the new assignment state is passed rather than the updated cluster state.

What if the TrainedModelAssignmentMetadata changed in the meantime?

That's a similar bug to the existing one, right? But much smaller, because the TrainedModelAssignmentMetadata changes less often than the ClusterState.

Should we protect against that? For example, only replace the TrainedModelAssignmentMetadata in the ClusterState if it's identical to the one we started the update computation with? And if it has changed, try this process again. More or less this paradigm:
https://github.com/elastic/elasticsearch/blob/main/test/framework/src/main/java/org/elasticsearch/common/util/MockBigArrays.java#L728-L742.

Maybe that's overkill and too complicated though. WDYT?

If we're deciding not to fix this, let's leave a comment about this small issue with this implementation and call it a day

Looking at the areClusterStatesCompatibleForRebalance function it also checks that the TrainedModelAssignmentMetadata has not changed.

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java

Line 610 in ffd02c5

&& TrainedModelAssignmentMetadata.fromState(source).equals(TrainedModelAssignmentMetadata.fromState(target));

By comparing the starting state with the latest state before applying the updated TrainedModelAssignmentMetadata (which is a lightweight operation and can be done in the ClusterStateUpdateTask) the code is effectively performing a "compare and swap" paradigm as linked above

OK, cool! I missed that part of the code. Guess this all works then as is

davidkyle · 2025-05-30T11:19:37Z

.../java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java

        final boolean isResetMode = MlMetadata.getMlMetadata(event.state()).isResetMode();
        TrainedModelAssignmentMetadata modelAssignmentMetadata = TrainedModelAssignmentMetadata.fromState(event.state());
        final String currentNode = event.state().nodes().getLocalNodeId();
-        final boolean isNewAllocationSupported = event.state()


Another version change that is irrelevant for 9

# Conflicts: # muted-tests.yml # x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java

jan-elastic

LGTM. Left one comment. Certainly a big improvement!

jan-elastic

LGTM

…ate (#128667) (#132159) (cherry picked from commit 989f72b) # Conflicts: # muted-tests.yml # x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java

…ate (#128667) (#132160) (cherry picked from commit 989f72b)

Use latest state

06bec76

davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 v9.0.3 v8.18.3 labels May 30, 2025

elasticsearchmachine added the Team:ML Meta label for the ML team label May 30, 2025

davidkyle removed v8.19.0 v8.18.3 labels May 30, 2025

davidkyle commented May 30, 2025

View reviewed changes

elasticsearchmachine added v9.0.4 and removed v9.0.3 labels Jun 19, 2025

elasticsearchmachine added v9.2.0 and removed v9.1.0 labels Jun 26, 2025

elasticsearchmachine added v9.0.5 and removed v9.0.4 labels Jul 10, 2025

Merge branch 'main' into fix-cs-state-update

ffd02c5

# Conflicts: # muted-tests.yml # x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentClusterService.java

davidkyle added auto-backport Automatically create backport pull requests when merged v9.1.1 and removed auto-backport Automatically create backport pull requests when merged labels Jul 22, 2025

jan-elastic requested changes Jul 22, 2025

View reviewed changes

jan-elastic approved these changes Jul 23, 2025

View reviewed changes

Merge branch 'main' into fix-cs-state-update

3080e0b

davidkyle added the backport pending label Jul 23, 2025

davidkyle merged commit 989f72b into elastic:main Jul 23, 2025
33 checks passed

davidkyle mentioned this pull request Jul 30, 2025

[ML] Fix test failure updating model deployment with stale cluster st… #132159

Merged

davidkyle mentioned this pull request Jul 30, 2025

[ML] Fix test failure updating model deployment with stale cluster st… #132160

Merged

davidkyle removed the backport pending label Jul 30, 2025

elasticsearchmachine pushed a commit that referenced this pull request Aug 1, 2025

[ML] Fix test failure updating model deployment with stale cluster st…

043ddc3

…ate (#128667) (#132160) (cherry picked from commit 989f72b)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Fix test failure updating model deployment with stale cluster state. #128667

[ML] Fix test failure updating model deployment with stale cluster state. #128667

Uh oh!

davidkyle commented May 30, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented May 30, 2025

Uh oh!

davidkyle May 30, 2025

Uh oh!

davidkyle May 30, 2025

Uh oh!

jan-elastic Jul 22, 2025

Uh oh!

jan-elastic Jul 22, 2025 •

edited

Loading

Uh oh!

davidkyle Jul 23, 2025

Uh oh!

jan-elastic Jul 23, 2025

Uh oh!

davidkyle May 30, 2025

Uh oh!

jan-elastic left a comment

Uh oh!

jan-elastic left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		private static final Logger logger = LogManager.getLogger(TrainedModelAssignmentClusterService.class);

		private static final TransportVersion RENAME_ALLOCATION_TO_ASSIGNMENT_TRANSPORT_VERSION = TransportVersions.V_8_3_0;

[ML] Fix test failure updating model deployment with stale cluster state. #128667

[ML] Fix test failure updating model deployment with stale cluster state. #128667

Uh oh!

Conversation

davidkyle commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented May 30, 2025

Uh oh!

davidkyle May 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidkyle May 30, 2025

Choose a reason for hiding this comment

Uh oh!

jan-elastic Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

jan-elastic Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidkyle Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

jan-elastic Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

davidkyle May 30, 2025

Choose a reason for hiding this comment

Uh oh!

jan-elastic left a comment

Choose a reason for hiding this comment

Uh oh!

jan-elastic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidkyle commented May 30, 2025 •

edited

Loading

jan-elastic Jul 22, 2025 •

edited

Loading