[Inference API] Remove worst-case additional 50ms latency for non-rate limited requests #136167

timgrein · 2025-10-08T09:32:46Z

With this PR I've added a new request queue to the inference API, which executes requests, which are not ingest embeddings requests (e.g. embeddings generation on search path, rerank etc.) immediately. Otherwise the requests are simply submitted to the rate limited execution path as before. This requests queue is executed in a separate thread, which blocks until a new request comes available avoiding explicit polling. This removes an additional worst-case latency of xpack.inference.http.request_executor.task_poll_frequency (default: 50ms), which we observed when generating sparse text embeddings using EIS.

I've verified the improved latency by running ES (with & without the changes introduced by this PR) and EIS locally and executing requests using vegeta.

Without optimization:

With optimization:

elasticsearchmachine · 2025-10-08T09:33:20Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2025-10-08T09:33:21Z

Hi @timgrein, I've created a changelog YAML for you.

…s-latency-issue

jaybcee · 2025-10-08T12:46:33Z

@timgrein Great work! Would you be able to similarly generate a graphic against a local EIS directly such that we understand if any overhead exists? I understand the logic behind the PR, but I'm curious what overhead exists still.

jonathan-buttner

Great work! Left a few suggestions.

jonathan-buttner · 2025-10-08T12:48:36Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

    public void shutdown() {
        if (shutdown.compareAndSet(false, true)) {
+            if (requestQueueTask != null) {
+                boolean cancelled = FutureUtils.cancel(requestQueueTask);


If I remember correctly, I think it's up to our implementation to check if it is canceled. So I think we'll get stuck in the queue.take() 🤔

It doesn't seem like FutureUtils.cancel() will do an interrupt.

This is how we've handled that in the past: https://github.com/elastic/elasticsearch/blob/8.13/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java#L253

The shutdown() method puts a noop task on the queue to ensure that it wakes up.

Can you double check the tests and make sure we're covering this case (we call shutdown and then await termination)?

Adjusted with Add NoopTask to wake up queue on shutdown

Can you double check the tests and make sure we're covering this case (we call shutdown and then await termination)?

AFAIU we always check that when calling submitShutdownRequest, right?

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

jonathan-buttner · 2025-10-08T12:51:48Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+            logger.debug("Inference request queue interrupted, exiting");
+        } catch (Exception e) {
+            logger.warn("Error processing task in inference request queue", e);
+            cleanup();


How about we move this to a finally block to ensure it gets called.

Moved with Cleanup in finally block

jonathan-buttner · 2025-10-08T12:56:38Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+            startHandlingRateLimitedTasks();
        } catch (Exception e) {
            logger.warn("Failed to start request executor", e);
            cleanup();


I think there's a small potential for an edge case here (if we go the noop task route to do a shutdown for the queue.take()). If an exception occurs in startHandlingRateLimitedTasks(), it could cause the requestQueue to be drained which could mean that it'd never get the noop task.

I'd have to think of a good way to solve that. Maybe we split up the cleanup methods so that this one doesn't drain the requestQueue, instead the processRequestQueue() would call a different cleanup() that'd handle doing that 🤔

@jonathan-buttner something like Only reject requests of the respective execution path (rate-limited v…?

jonathan-buttner · 2025-10-08T12:57:07Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+        try {
+            while (isShutdown() == false) {
+                // Blocks the request queue thread until a new request comes in
+                var task = (RequestTask) requestQueue.take();


Do we need the cast?

I think it would be better to replace uses of RequestTask with RejectableTask in this class, since there's no reason that the interface can't be used throughout.

Remove unnecessary cast

jonathan-buttner · 2025-10-08T12:59:41Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+                var task = (RequestTask) requestQueue.take();
+
+                if (isShutdown()) {
+                    logger.debug("Shutdown requested while handling request tasks, cleaning up");


If we're shutting down we need to reject the task we pulled off the requestQueue

Here's an example of doing that: https://github.com/elastic/elasticsearch/blob/8.13/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java#L192

Adjusted with Reject request on shutdown

jonathan-buttner · 2025-10-08T13:00:15Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+
+                if (isShutdown()) {
+                    logger.debug("Shutdown requested while handling request tasks, cleaning up");
+                    cleanup();


I think if we move this to a finally block we probably don't need it here and we could probably remove the return.

jonathan-buttner · 2025-10-08T13:06:24Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

-        });
-
-        endpoint.enqueue(task);
+        if (taskAccepted == false) {


If the task was accepted we'll want to check is shutdown one last time to ensure we notify this task that we're shutting down.

Here's an example: https://github.com/elastic/elasticsearch/blob/8.13/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java#L302

jonathan-buttner · 2025-10-08T13:10:09Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+        endpoint.enqueue(task);
+    }
+
+    private boolean rateLimitingEnabled(RequestManager requestManager) {


For consistency how about we make this static and accept a RateLimitSettings object. Then we can use it in executeEnqueuedTaskInternal which has a similar check. Technically executeEnqueuedTaskInternal should never receive a non-rate limited task, but it'd probably be good to check just in case.

Good point! Adjusted with Reuse rateLimitSettingsEnabled check

timgrein · 2025-10-08T13:25:21Z

@timgrein Great work! Would you be able to similarly generate a graphic against a local EIS directly such that we understand if any overhead exists? I understand the logic behind the PR, but I'm curious what overhead exists still.

Do you mean on my local machine with "local EIS"? If so, ES and EIS ran on my local machine for the results you see above

…s-latency-issue

DonalEvans · 2025-10-08T18:13:14Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+        // Reject non-rate-limited requests
+        List<RejectableTask> requests = new ArrayList<>();
+        requestQueue.drainTo(requests);
+
+        for (var request : requests) {
+            rejectRequest(request);
+        }
+    }
+
+    private void rejectRequest(RejectableTask task) {
+        try {
+            task.onRejection(
+                new EsRejectedExecutionException(
+                    format(
+                        "Failed to send request for inference id [%s] has shutdown prior to executing request",
+                        task.getRequestManager().inferenceEntityId()
+                    ),
+                    true
+                )
+            );
+        } catch (Exception e) {
+            logger.warn(
+                format(
+                    "Failed to notify request for inference id [%s] of rejection after executor service shutdown",
+                    task.getRequestManager().inferenceEntityId()
+                )
+            );
+        }


This code is duplicated in RateLimitingEndpointHandler, with the only difference being the format of the error message. Would it be possible to extract it to a static method that both places could call? Rather than needing to use the id field directly for the service grouping, it can be derived from the task: Integer.toString(task.getRequestManager().rateLimitGrouping().hashCode())

There are also quite a few other places where we reject tasks due to shutdown that could be extracted to a method call to reduce code duplication and ensure consistency for the error messages.

I think the id (inferenceEntityId being the name of the inference endpoint AFAIU) and service grouping are two different things, so IMO we should keep id in the error messages, so it's clear, which endpoint failed to execute a request. I think the hashCode of the rateLimitGrouping could become quite cryptic for a clear log/error message.

But nothing speaks against extracting the common part to a static method 👍

Extract rejection logic to common static method

Sorry for the confusion, the id field in RateLimitingEndpointHandler is the service grouping hash code. The field is not well named in that respect.

DonalEvans · 2025-10-08T18:20:55Z

...java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorServiceTests.java

+            thrownException.getMessage(),
+            is(
+                Strings.format(
+                    "Failed to send request, request service [3355] for inference id [id] has shutdown prior to executing request",


This should be service [%s] for inference

I think it actually is: error message formatting. AFAIUhashCode is simply executed on the inferenceEntityId leading to 3355 in this case.

What I mean is that you're using String.format() to construct the error message and passing requestManager.rateLimitGrouping().hashCode() as the second argument to it, but the string in the first argument doesn't have any placeholders for that value to be used. Right now the test is passing because the rate limit grouping is a constant from run to run, but if it ever changes, then the hard-coded hash code of 3355 will no longer be correct and the test will fail.

Ah got it now, adjusted with Use string placeholder in assertion

DonalEvans · 2025-10-08T18:35:52Z

...java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorServiceTests.java

+        service.submitTaskToRateLimitedExecutionPath(
+            new RequestTask(requestManager, new EmbeddingsInput(List.of(), InputTypeTests.randomWithNull()), null, threadPool, listener)
+        );


Is this change necessary? The test still passes if execute() is called instead of submitTaskToRateLimitedExecutionPath().

Also, while I know you didn't add it in this PR, this test's name is at odds with what it's actually asserting. The name implies that we don't expect an exception when polling the queue to cause the service to terminate, but the test explicitly terminates the service as part of throwing the exception, then asserts that it's terminated. Is this test actually testing anything other than that calling shutdown() causes the service to shut down?

Adjusted the test to use execute again.

Is this test actually testing anything other than that calling shutdown() causes the service to shut down?

Good point, I've adjusted the test to assert that the service still runs after task.execute(...) threw an exception. I've used task.execute(...) instead of queue.poll(), because poll() is also used internally in AdjustableCapacityBlockingQueue when calling requestQueue.take() in processRequestQueue, which does in fact terminate the service and made the test green for the wrong reason. Commit: Adjust test to check that a throwing task does not terminate the service. (I'll add a similar test for the non-rate-limited execution path)

Speaking of that: executeTaskImmediately inside processRequestQueue handles any Exceptions thrown by the processed request task without terminating the service. AFAIU take() shouldn't throw except when it's calling thread is interrupted, which we need to handle explicitly anyway. I've adjusted the error message in the general Exception handler with Adjust error message in general exception handler to reflect that an exception here is not coming from a task or a task rejection, but is potentially a more fundamental issue leading to service termination. Just wanted to double-check, if my reasoning is correct here and we want this behavior?

DonalEvans · 2025-10-08T18:59:13Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

-            endpointHandler.init();
-            return endpointHandler;
-        });
+        boolean taskAccepted = requestQueue.offer(task);


If I understand correctly, prior to this change, when execute() was called, we would retrieve (or create) a RateLimitingEndpointHandler appropriate to the task's rate limit settings, then add the task to the queue associated with that handler. This meant that there was one queue per RateLimitingEndpointHandler, each with a capacity defined by the RequestExecutorServiceSettings, and that we wouldn't reject a task unless the queue for the handler associated with it was full.

With the new change, all tasks are first submitted to the single requestQueue queue, which has the same capacity as each of the queues managed by the handlers, meaning that we can begin rejecting tasks even though none of the queues associated with the handlers are full, effectively reducing the total number of requests we can process in a given time.

Would it be better to instead only add tasks with no rate limit settings to the request queue, and call submitTaskToRateLimitedExecutionPath() in the execute() method for tasks with rate limit settings? That way, we're not adding rate limited tasks to one queue just to later remove them and add them to another queue.

That's a very valid point @DonalEvans! I'll adjust the implementation, thanks for flagging

Only add non-rate-limited tasks to fast-path request queue

DonalEvans · 2025-10-10T17:07:41Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+
+        rejectRequest(
+            task,
+            format("Failed to send request for inference id [%s] has shutdown prior to executing request", inferenceEntityId),


This message might be better as "Failed to send request for inference id [%s] because the request executor service has been shutdown" to make it consistent with the error we report in execute() if we try to queue a task when we're shut down.

Adjust error message when request gets rejected

DonalEvans · 2025-10-10T17:16:54Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+                    task,
+                    format(
+                        "Failed to send request, request service [%s] for inference id [%s] has shutdown prior to executing request",
+                        id,


Not strictly related to this PR, but since you're making changes in this class, could you rename the id field on RateLimitingEndpointHandler to be something more descriptive, like rateLimitGroupingId?

Rename id in RateLimitingEndpointHandler to rateLimitGroupingId

DonalEvans · 2025-10-10T17:20:23Z

...java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorServiceTests.java

-                    requestManager.rateLimitGrouping().hashCode()
-                )
-            )
+            is("Failed to execute task for inference id [id] because the request service [3355] queue is full")


This should be using Strings.format() with [3355] replaced with [%s] and the other argument being requestManager.rateLimitGrouping().hashCode().

Use Strings.format(...) in assertion

DonalEvans · 2025-10-10T17:30:02Z

...java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorServiceTests.java

+        @SuppressWarnings("unchecked")
+        var stubbing = when(mockExecutorService.submit(any(Runnable.class))).thenReturn(mock(Future.class));


This suppression and unused variable can be avoided using the thenAnswer() method, which I think is a little cleaner:

when(mockExecutorService.submit(any(Runnable.class))).thenAnswer(i -> mock(Future.class));

Use thenAnswer instead of suppression

…s. not rate-limited)

…s-latency-issue

davidkyle · 2025-10-14T10:14:30Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

        }

-        if (rateLimitingEnabled(requestManager.rateLimitSettings())) {
+        if (isEmbeddingsIngestInput(inferenceInputs)) {


The rateLimitingEnabled check should be retained

if (isEmbeddingsIngestInput(inferenceInputs) || rateLimitingEnabled(requestManager.rateLimitSettings())) {

jonathan-buttner · 2025-10-14T13:41:19Z

...main/java/org/elasticsearch/xpack/inference/external/http/sender/RequestExecutorService.java

+        }
+    }
+
+    private static final RejectableTask NoopTask = new RejectableTask() {


nit: When I first saw the comparison I thought this was a class name. How about we make this all caps or noopTask.

Remove worst-case additional 50ms latency for non-rate limited requests

f1e7419

timgrein added >bug :ml Machine learning Team:ML Meta label for the ML team v9.2.0 v9.2.1 labels Oct 8, 2025

elasticsearchmachine added the v9.3.0 label Oct 8, 2025

Update docs/changelog/136167.yaml

a326e31

timgrein requested a review from jonathan-buttner October 8, 2025 09:39

timgrein and others added 4 commits October 8, 2025 11:47

Merge branch 'main' into es-eis-latency-issue

191029f

Do not use forbidden API

edaa0f0

Merge remote-tracking branch 'origin/es-eis-latency-issue' into es-ei…

57c5605

…s-latency-issue

Merge branch 'main' into es-eis-latency-issue

f98b82e

jonathan-buttner reviewed Oct 8, 2025

View reviewed changes

davidkyle added backport pending and removed v9.2.0 labels Oct 8, 2025

timgrein added 5 commits October 8, 2025 16:56

Move startRequestQueueTask before start signal

f17ec00

Merge remote-tracking branch 'origin/es-eis-latency-issue' into es-ei…

90ee1a1

…s-latency-issue

Cleanup in finally block

a9e7610

Reject request on shutdown

ec513be

Reuse rateLimitSettingsEnabled check

174526c

DonalEvans reviewed Oct 8, 2025

View reviewed changes

timgrein added 4 commits October 9, 2025 11:11

Add NoopTask to wake up queue on shutdown

f506cb3

Only add non-rate-limited tasks to fast-path request queue

ae349fd

Extract rejection logic to common static method

540f49d

Remove unnecessary cast

0dca88a

timgrein added 4 commits October 10, 2025 10:12

Use string placeholder in assertion

0590561

Adjust test to check that a throwing task does not terminate the service

2e65475

Adjust error message in general exception handler

4fb2372

Adjust warn to error

2930151

DonalEvans reviewed Oct 10, 2025

View reviewed changes

timgrein added 5 commits October 13, 2025 09:44

Adjust error message when request gets rejected

b2fd85f

Rename id in RateLimitingEndpointHandler to rateLimitGroupingId

91f387a

Use Strings.format(...) in assertion

90e672b

Use thenAnswer instead of suppression

1cf24dc

Only reject requests of the respective execution path (rate-limited v…

a92a7c0

…s. not rate-limited)

DonalEvans approved these changes Oct 13, 2025

View reviewed changes

timgrein and others added 3 commits October 14, 2025 10:32

Merge branch 'main' into es-eis-latency-issue

8e52c22

Submit only ingest embeddings requests to rate-limited execution path

a868152

Merge remote-tracking branch 'origin/es-eis-latency-issue' into es-ei…

dd53fcb

…s-latency-issue

davidkyle reviewed Oct 14, 2025

View reviewed changes

jonathan-buttner reviewed Oct 14, 2025

View reviewed changes

timgrein added 2 commits October 14, 2025 15:52

Add rate limiting check

c575eba

Make NoopTask all caps

69db0e1

		@SuppressWarnings("unchecked")
		var stubbing = when(mockExecutorService.submit(any(Runnable.class))).thenReturn(mock(Future.class));

[Inference API] Remove worst-case additional 50ms latency for non-rate limited requests #136167

Are you sure you want to change the base?

[Inference API] Remove worst-case additional 50ms latency for non-rate limited requests #136167

Conversation

timgrein commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 8, 2025

Uh oh!

elasticsearchmachine commented Oct 8, 2025

Uh oh!

jaybcee commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathan-buttner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timgrein Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonathan-buttner Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timgrein Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timgrein Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timgrein commented Oct 8, 2025

Uh oh!

DonalEvans Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DonalEvans Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timgrein Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timgrein commented Oct 8, 2025 •

edited

Loading

jaybcee commented Oct 8, 2025 •

edited

Loading

timgrein Oct 8, 2025 •

edited

Loading

jonathan-buttner Oct 8, 2025 •

edited

Loading

timgrein Oct 8, 2025 •

edited

Loading

timgrein Oct 8, 2025 •

edited

Loading

DonalEvans Oct 8, 2025 •

edited

Loading

DonalEvans Oct 8, 2025 •

edited

Loading

timgrein Oct 9, 2025 •

edited

Loading

timgrein Oct 10, 2025 •

edited

Loading