Skip to content

Conversation

@valeriy42
Copy link
Contributor

Backports the following commits to 9.0:

… node is temoporarily unavailable (elastic#129391)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes elastic#126148
@valeriy42 valeriy42 added :ml Machine learning >bug auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport cloud-deploy Publish cloud docker image for Cloud-First-Testing Team:ML Meta label for the ML team labels Jun 13, 2025
@valeriy42 valeriy42 removed the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jun 13, 2025
@elasticsearchmachine elasticsearchmachine merged commit d4692d2 into elastic:9.0 Jun 16, 2025
17 checks passed
@valeriy42 valeriy42 deleted the backport/9.0/pr-129391 branch June 16, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport >bug :ml Machine learning Team:ML Meta label for the ML team v9.0.3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants