Skip to content

Conversation

@jonathan-buttner
Copy link
Contributor

This PR implements the ability to revoke authorization for default endpoints. When a node boots up, it makes a call the EIS gateway. If the gateway returns a list of model ids that does not include one of the default endpoint models, we will revoke authorization. To revoke authorization, we instruct the ModelRegistry to remove the default endpoints that were absent from the authorization response from its in memory cache. The ModelRegistry then performs a delete-by-query to remove the inference endpoint ids.

I also added a check in the deletion code path to prevent users from deleting default endpoint ids.

@jonathan-buttner jonathan-buttner added >non-issue :ml Machine learning Team:ML Meta label for the ML team auto-backport Automatically create backport pull requests when merged Feature:GenAI Features around GenAI v9.0.0 v8.18.0 v8.19.0 labels Jan 30, 2025
ClusterState state,
ActionListener<DeleteInferenceEndpointAction.Response> masterListener
) {
if (modelRegistry.containsDefaultConfigId(request.getInferenceEndpointId())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will prevent REST calls from deleting default inference endpoints.

}));
}

// TODO should we add a lock on the default model id so we can't attempt to delete it while we're adding it?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the default model ids to the lock map while they're being persisted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with leaving it as is? I'm thinking the worst case scenario is:

  • Thread 1 starts persist
  • Thread 2 starts delete
  • Thread 2 finishes delete
  • Thread 1 finishes persist

But the next call will just redo the delete? And we won't get stuck in some sort of a delete -> put loop? Maybe? The only risk that I see is that the storeDefaultEndpoint is potentially called as part of _xpack/usage, so it is running frequently in some environments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah now that I think about it more I'm not even sure that's possible at least at the moment. The only way we would add a new default endpoint that has a possibility of being deleted would be when a node boots up for EIS. A single node wouldn't try to add the EIS default endpoints and delete them.

Two separate nodes could run into a situation though: Where node A gets the authorization response and begins persisting the default endpoints. Node B boots up but the authorization request fails for some reason, it would then attempt to delete the default endpoints.

It looks like we're not aborting conflicts though so I'd expect one of those calls to succeed and then when another node restarts it would remove the default endpoint eventually.

DeleteByQueryRequest request = new DeleteByQueryRequest().setAbortOnVersionConflict(false);
request.indices(InferenceIndex.INDEX_PATTERN, InferenceSecretsIndex.INDEX_PATTERN);
request.setQuery(documentIdQuery(inferenceEntityId));
request.setQuery(documentIdsQuery(inferenceEntityIds));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting side not, delete by query will not fail if it can't find an id. Which is helpful here because for the revoking of authorization situation, we don't really care if the model exists already or not, just that it is removed if it did exist.

authorizationCompletedLatch.countDown();
});

getServiceComponents().threadPool()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being called from the onResponse of an action listener. Which is probably already on a utility thread but just in case we'll kick off a new one.

* @param waitTime the max time to wait
* @throws IllegalStateException if the wait time is exceeded or the call receives an {@link InterruptedException}
*/
public void waitForAuthorizationToComplete(TimeValue waitTime) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this public so I can access it in the integration tests that were created.

@jonathan-buttner
Copy link
Contributor Author

@elasticmachine merge upstream

@elasticmachine
Copy link
Collaborator

There are no new commits on the base branch.

@jonathan-buttner jonathan-buttner marked this pull request as ready for review January 31, 2025 15:08
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Comment on lines 229 to 230
@SuppressWarnings("unchecked")
var listener = (ActionListener<List<Model>>) invocation.getArguments()[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another API that does the implicit typecasting magic, if you prefer:

ActionListener<List<Model>> listener = invocation.getArgument(0);

}));
}

// TODO should we add a lock on the default model id so we can't attempt to delete it while we're adding it?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with leaving it as is? I'm thinking the worst case scenario is:

  • Thread 1 starts persist
  • Thread 2 starts delete
  • Thread 2 finishes delete
  • Thread 1 finishes persist

But the next call will just redo the delete? And we won't get stuck in some sort of a delete -> put loop? Maybe? The only risk that I see is that the storeDefaultEndpoint is potentially called as part of _xpack/usage, so it is running frequently in some environments.

@jonathan-buttner jonathan-buttner merged commit 671ecd0 into elastic:main Feb 6, 2025
17 checks passed
@jonathan-buttner jonathan-buttner deleted the ml-handle-auth-revoke branch February 6, 2025 13:28
jonathan-buttner added a commit to jonathan-buttner/elasticsearch that referenced this pull request Feb 6, 2025
…ic#121326)

* Starting revoke

* Adding integration tests

* More integration tests

* Adding test for deleting default inference endpoint via rest call

* Removing task type any

* Addressing feedback and adding test
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
9.0
8.18
8.x

jonathan-buttner added a commit to jonathan-buttner/elasticsearch that referenced this pull request Feb 6, 2025
…ic#121326)

* Starting revoke

* Adding integration tests

* More integration tests

* Adding test for deleting default inference endpoint via rest call

* Removing task type any

* Addressing feedback and adding test
jonathan-buttner added a commit to jonathan-buttner/elasticsearch that referenced this pull request Feb 6, 2025
…ic#121326)

* Starting revoke

* Adding integration tests

* More integration tests

* Adding test for deleting default inference endpoint via rest call

* Removing task type any

* Addressing feedback and adding test
elasticsearchmachine pushed a commit that referenced this pull request Feb 6, 2025
…121326) (#121906)

* [ML] Support revoking inference default endpoint authorization (#121326)

* Starting revoke

* Adding integration tests

* More integration tests

* Adding test for deleting default inference endpoint via rest call

* Removing task type any

* Addressing feedback and adding test

* Fixing tests
elasticsearchmachine pushed a commit that referenced this pull request Feb 6, 2025
…#121326) (#121907)

* [ML] Support revoking inference default endpoint authorization (#121326)

* Starting revoke

* Adding integration tests

* More integration tests

* Adding test for deleting default inference endpoint via rest call

* Removing task type any

* Addressing feedback and adding test

* Fixing tests
elasticsearchmachine pushed a commit that referenced this pull request Feb 6, 2025
…121326) (#121908)

* [ML] Support revoking inference default endpoint authorization (#121326)

* Starting revoke

* Adding integration tests

* More integration tests

* Adding test for deleting default inference endpoint via rest call

* Removing task type any

* Addressing feedback and adding test

* Fixing tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged Feature:GenAI Features around GenAI :ml Machine learning >non-issue Team:ML Meta label for the ML team v8.18.0 v8.18.1 v8.19.0 v9.0.0 v9.0.1 v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants