Skip to content

Conversation

@jonathan-buttner
Copy link
Contributor

@jonathan-buttner jonathan-buttner commented Oct 16, 2025

This PR is based on: #136569 Already merged

This PR moves the EIS authorization polling logic to a persistent task on a single node.

Notable changes:

  • It removes the polling logic from occurring on each node
  • A cluster state listener is registered which checks to see if the task exists and if it doesn't, it creates the task
  • If a node running the task shuts down, the persistent task framework handles moving the task to a new node
  • If the EIS url is empty or null, the persistent task will not be created
  • If a cluster is no longer authorized to access certain preconfigured endpoints, the endpoints will remain instead of being removed
  • The polling logic compares the received authorized models with the preconfigured inference endpoints that are already stored in cluster state to determine if any are new. Only new preconfigured inference endpoints are stored
  • The polling logic uses a new action to send the new inference endpoints to the master node to be store. The master node must do this logic because it updates the cluster state

Testing

Start EIS

cd eis-gateway
make TLS_VERIFY_CLIENT_CERTS=false run

Start ES pointing at EIS

run-es -Dtests.es.xpack.inference.elastic.url=https://localhost:8443 -Dtests.es.xpack.inference.elastic.http.ssl.verification_mode=none -Dtests.es.xpack.inference.elastic.authorization_request_interval="5s" -Dtests.es.xpack.inference.elastic.max_authorization_request_jitter="1s"

Retrieve all the endpoints from the inference API should return some EIS endpoints now

GET _inference/_all

A task should be present in the list eis-authorization-poller[c]

GET _tasks

@jonathan-buttner jonathan-buttner added >enhancement :ml Machine learning Team:ML Meta label for the ML team v9.3.0 labels Oct 16, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @jonathan-buttner, I've created a changelog YAML for you.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use the master for admin tasks that don't actually need to run on the master. If you need a task to run approximately once in the cluster, use a persistent task instead.

@jonathan-buttner jonathan-buttner changed the title [ML] Transition EIS auth polling to master node [ML] Transition EIS auth polling to persistent task on a single node Oct 30, 2025
@jonathan-buttner jonathan-buttner marked this pull request as ready for review November 4, 2025 18:29
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@jonathan-buttner jonathan-buttner dismissed DaveCTurner’s stale review November 4, 2025 18:43

I chatted with Dave offline and changed the implementation based on his feedback. Dave advised to not force the polling logic to occur on the master node and to do it within a persistent task which I've addressed.

*/
public class AuthorizationTaskExecutorMultipleNodesIT extends ESIntegTestCase {

private static final String AUTH_TASK_ACTION = AuthorizationPoller.TASK_NAME + "[c]";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var serviceToInferenceIds = new HashMap<String, Set<String>>();
for (var entry : modelMap.entrySet()) {
var settings = entry.getValue();
var serviceName = settings.service();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

settings.service() can return null, is that a problem? it looks like the map would not throw an error, so idk when this would happen or be a problem 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not sure why we allow service to be null 🤔 I added a test for it. If it works correctly, it should just bucket the null ones together.

https://github.com/elastic/elasticsearch/pull/136713/files#diff-d4185ce634ada9ae507c764714a8049806cb0c4fdd576eedc467c46796256572R333-R360

Test case
    public void testGetServiceInferenceIds_AcceptsNullKeys() {
        var serviceA = "service_a";
        var endpointId1 = "endpointId1";
        var endpointId2 = "endpointId2";
        var nullEndpoint1 = "nullEndpoint1";
        var nullEndpoint2 = "nullEndpoint2";

        var settings1 = MinimalServiceSettings.chatCompletion(serviceA);
        var settings2 = MinimalServiceSettings.sparseEmbedding(serviceA);
        // I'm not sure why minimal service settings would have a null service name, but testing it anyway
        var nullServiceNameSettings1 = MinimalServiceSettings.sparseEmbedding(null);
        var nullServiceNameSettings2 = MinimalServiceSettings.sparseEmbedding(null);
        var models = Map.of(
            endpointId1,
            settings1,
            endpointId2,
            settings2,
            nullEndpoint1,
            nullServiceNameSettings1,
            nullEndpoint2,
            nullServiceNameSettings2
        );
        var metadata = new ModelRegistryMetadata(ImmutableOpenMap.builder(models).build());

        var serviceEndpoints = metadata.getServiceInferenceIds(serviceA);
        assertThat(serviceEndpoints, is(Set.of(endpointId1, endpointId2)));
        assertThat(metadata.getServiceInferenceIds(null), is(Set.of(nullEndpoint1, nullEndpoint2)));

Comment on lines 145 to 147
if (lastAuthTask.get() != null) {
lastAuthTask.get().cancel();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it matters since scheduleAndSendAuthorizationRequest checks the shutdown status, but in theory one thread could set a different ScheduledCancellable in between 145 and 146

Suggested change
if (lastAuthTask.get() != null) {
lastAuthTask.get().cancel();
}
var authTask = lastAuthTask.get();
if (authTask != null) {
authTask.cancel();
}

@jonathan-buttner jonathan-buttner enabled auto-merge (squash) November 7, 2025 16:31
@jonathan-buttner jonathan-buttner merged commit 26d49b9 into elastic:main Nov 10, 2025
35 checks passed
@jonathan-buttner jonathan-buttner deleted the ml-eis-auth-polling branch November 10, 2025 13:15
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
…lastic#136713)

* Creating new cluster state listener to kick off polling logic

* Update docs/changelog/136713.yaml

* [CI] Auto commit changes from spotless

* Starting persistent tasks

* Switching to a persistent task, need to create the action though

* Adding master action

* Successful task creation

* Starting tests

* More tests

* Even more tests

* [CI] Auto commit changes from spotless

* Starting integration tests

* Adding test stub

* [CI] Auto commit changes from spotless

* Adding integration test

* Fixing relocation test

* [CI] Auto commit changes from spotless

* working test

* Some clean up

* Removing unneeded tests

* [CI] Auto commit changes from spotless

* Refactoring tests

* updating transport version

* [CI] Auto commit changes from spotless

* Fixing transport version

* Fixing check for preconfigured endpoints

* [CI] Auto commit changes from spotless

* Fixing tests

* Fixing text embedding test

* Addressing feedback

* Marking task as failed

* Fixing flaky test

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :ml Machine learning Team:ML Meta label for the ML team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants