Add ModelRegistryMetadata to Cluster State #121106

jimczi · 2025-01-28T22:04:33Z

This commit integrates MinimalServiceSettings (introduced in #120560) into the cluster state for all registered models in the ModelRegistry. These settings allow consumers to access configuration details without requiring asynchronous calls to retrieve full model configurations.

To ensure consistency, the cluster state metadata must remain synchronized with the models in the inference index. If a mismatch is detected during startup, the master node performs an upgrade to load all model settings from the index.

This commit integrates `MinimalServiceSettings` (introduced in elastic#120560) into the cluster state for all registered models in the `ModelRegistry`. These settings allow consumers to access configuration details without requiring asynchronous calls to retrieve full model configurations. To ensure consistency, the cluster state metadata must remain synchronized with the models in the inference index. If a mismatch is detected during startup, the master node performs an upgrade to load all model settings from the index.

elasticsearchmachine · 2025-01-28T22:05:20Z

Hi @jimczi, I've created a changelog YAML for you.

…ter_state

…to model_registry_cluster_state

…ter_state

…to model_registry_cluster_state

…ter_state

jonathan-buttner · 2025-01-29T17:29:54Z

Just thinking out loud while looking at the PR:

These settings allow consumers to access configuration details without requiring asynchronous calls to retrieve full model configurations.

Do consumers have to access this information frequently?

Do you think there'd be any benefit to using a cache specific to these settings instead of the cluster state metadata? I guess we'd run into the same issue of needing the synchronized across all the nodes 🤔

The Elastic Inference Service makes an asynchronous authorization call when the node boots up to determine which default inference endpoints are enabled. Does that cause any issues with this solution? Basically it means we don't know immediately after the node boots up what all the default inference endpoints are.

Here's where that call happens:

https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/elastic/ElasticInferenceService.java#L186

jimczi · 2025-01-29T17:43:23Z

Do consumers have to access this information frequently?

For semantic_text, only when creating a new field so very infrequently.

The Elastic Inference Service makes an asynchronous authorization call when the node boots up to determine which default inference endpoints are enabled. Does that cause any issues with this solution? Basically it means we don't know immediately after the node boots up what all the default inference endpoints are.

I think that's fine. That would mean that creating new semantic_text field that depends on this models could fail before the default endpoints are added.
On a separate note, do we handle the case where the default models change? With the current system, if the authorisation fail, the default models won't be loaded but they might be already stored in the index, right?

jonathan-buttner · 2025-01-29T18:28:17Z

For semantic_text, only when creating a new field so very infrequently.

If it's very infrequently, what's the benefit to move it into the cluster state? Is it because the call to get the whole model is expensive?

I think that's fine. That would mean that creating new semantic_text field that depends on this models could fail before the default endpoints are added.

Yeah, probably not super likely since for that to happen the semantic_text field would have to be created like immediately after a node is finished booting 🤷‍♂️

On a separate note, do we handle the case where the default models change? With the current system, if the authorisation fail, the default models won't be loaded but they might be already stored in the index, right?

Oh that's an interesting point I hadn't thought of. I think what you're saying is if we revoke access to an model after it was granted previously?

Let me ping the EIS team on how we should handle that.

jimczi · 2025-01-29T19:02:28Z

If it's very infrequently, what's the benefit to move it into the cluster state? Is it because the call to get the whole model is expensive?

The call would happen on the master node when updating/creating a mapping so we cannot block the thread to get the model from the index. Today we are lenient and get the model definition in a later stage but now that we want to add options to setup the inner fields we have to know the model early.

jonathan-buttner · 2025-01-29T21:06:53Z

The call would happen on the master node when updating/creating a mapping so we cannot block the thread to get the model from the index. Today we are lenient and get the model definition in a later stage but now that we want to add options to setup the inner fields we have to know the model early.

Ah I see.

…ter_state

jonathan-buttner

The approach makes sense to me. I left a couple questions for things that weren't immediately clear to me.

...plugin/inference/src/main/java/org/elasticsearch/xpack/inference/registry/ModelRegistry.java

...nference/src/main/java/org/elasticsearch/xpack/inference/registry/ModelRegistryMetadata.java

…ter_state

davidkyle

Everything looks great if you can remove the return_minimal_config parameter from the REST API pls.

...rade/src/javaRestTest/java/org/elasticsearch/xpack/application/InferenceUpgradeTestCase.java

...pgrade/src/javaRestTest/java/org/elasticsearch/xpack/application/ModelRegistryUpgradeIT.java

...plugin/inference/src/main/java/org/elasticsearch/xpack/inference/registry/ModelRegistry.java

…ter_state

davidkyle

LGTM

…ter_state

…to model_registry_cluster_state

…ter_state

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since #121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

…ic#125242) When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

* Add ModelRegistryMetadata to Cluster State (#121106) This commit integrates `MinimalServiceSettings` (introduced in #120560) into the cluster state for all registered models in the `ModelRegistry`. These settings allow consumers to access configuration details without requiring asynchronous calls to retrieve full model configurations. To ensure consistency, the cluster state metadata must remain synchronized with the models in the inference index. If a mismatch is detected during startup, the master node performs an upgrade to load all model settings from the index. * fix test compil * fix serialisation * Exclude Default Inference Endpoints from Cluster State Storage (#125242) When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since #121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

…ic#125242) When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

This commit integrates `MinimalServiceSettings` (introduced in elastic#120560) into the cluster state for all registered models in the `ModelRegistry`. These settings allow consumers to access configuration details without requiring asynchronous calls to retrieve full model configurations. To ensure consistency, the cluster state metadata must remain synchronized with the models in the inference index. If a mismatch is detected during startup, the master node performs an upgrade to load all model settings from the index.

…ic#125242) When retrieving a default inference endpoint for the first time, the system automatically creates the endpoint. However, unlike the `put inference model` action, the `get` action does not redirect the request to the master node. Since elastic#121106, we rely on the assumption that every model creation (`put model`) must run on the master node, as it modifies the cluster state. However, this assumption led to a bug where the get action tries to store default inference endpoints from a different node. This change resolves the issue by preventing default inference endpoints from being added to the cluster state. These endpoints are not strictly needed there, as they are already reported by inference services upon startup. **Note:** This bug did not prevent the default endpoints from being used, but it caused repeated attempts to store them in the index, resulting in logging errors on every usage.

jimczi added >enhancement :ml Machine learning labels Jan 28, 2025

jimczi requested review from davidkyle and jonathan-buttner January 28, 2025 22:04

elasticsearchmachine added the v9.0.0 label Jan 28, 2025

jimczi and others added 11 commits January 28, 2025 22:05

Update docs/changelog/121106.yaml

f5de01b

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

62693db

…ter_state

Merge remote-tracking branch 'origin/model_registry_cluster_state' in…

0d42afd

…to model_registry_cluster_state

[CI] Auto commit changes from spotless

e24fc9a

avoid creating the inference index eagerly

aaf2170

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

66d53ef

…ter_state

Merge remote-tracking branch 'origin/model_registry_cluster_state' in…

5445962

…to model_registry_cluster_state

[CI] Auto commit changes from spotless

24a14aa

plug missing minimal supported version

3720182

fix test model with missing service settings

8939d1a

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

d5377da

…ter_state

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

6694f11

…ter_state

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

jonathan-buttner reviewed Jan 31, 2025

View reviewed changes

...plugin/inference/src/main/java/org/elasticsearch/xpack/inference/registry/ModelRegistry.java Show resolved Hide resolved

...nference/src/main/java/org/elasticsearch/xpack/inference/registry/ModelRegistryMetadata.java Show resolved Hide resolved

jimczi added 3 commits February 3, 2025 09:52

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

0a480b3

…ter_state

fix more tests

7bd6858

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

ca1c709

…ter_state

jimczi added 4 commits March 7, 2025 13:07

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

eabd993

…ter_state

adapt the custom metadata to project metadata

1252fa2

fix serialisation

c733cc5

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

c5f84ba

…ter_state

davidkyle reviewed Mar 13, 2025

View reviewed changes

jimczi added 3 commits March 14, 2025 17:05

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

066a53d

…ter_state

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

51ffe61

…ter_state

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

494d96e

…ter_state

davidkyle approved these changes Mar 17, 2025

View reviewed changes

jimczi and others added 8 commits March 17, 2025 12:14

Merge branch 'main' into model_registry_cluster_state

a660532

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

a0603e1

…ter_state

short circuit upgrades on multiple projects

1059c67

Merge remote-tracking branch 'origin/model_registry_cluster_state' in…

9f29e70

…to model_registry_cluster_state

handle exception if multi-project is used

3b19ad2

[CI] Auto commit changes from spotless

2ebccf2

Merge remote-tracking branch 'upstream/main' into model_registry_clus…

ca90472

…ter_state

Merge branch 'main' into model_registry_cluster_state

8240399

jimczi merged commit 270ec53 into elastic:main Mar 18, 2025
17 checks passed

jimczi deleted the model_registry_cluster_state branch March 18, 2025 10:12

jimczi added the backport pending label Mar 18, 2025

jimczi mentioned this pull request Mar 18, 2025

[8.x] Add ModelRegistryMetadata to Cluster State #125150

Merged

jimczi mentioned this pull request Mar 19, 2025

Exclude Default Inference Endpoints from Cluster State Storage #125242

Merged

jimczi mentioned this pull request Apr 1, 2025

ES|QL Reranker command #123074

Merged

9 tasks

Add ModelRegistryMetadata to Cluster State #121106

Add ModelRegistryMetadata to Cluster State #121106

Uh oh!

Conversation

jimczi commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jan 28, 2025

Uh oh!

jonathan-buttner commented Jan 29, 2025

Uh oh!

jimczi commented Jan 29, 2025

Uh oh!

jonathan-buttner commented Jan 29, 2025

Uh oh!

jimczi commented Jan 29, 2025

Uh oh!

jonathan-buttner commented Jan 29, 2025

Uh oh!

jonathan-buttner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jimczi commented Jan 28, 2025 •

edited

Loading