ES|QL Inference runner refactoring #131986

afoucret · 2025-07-28T06:58:00Z

This PR implements a batch of evolution in the ES|QL inference runner that are required in the context of the TEXT_EMBEDDING function implementation.

Changes:

Full rewrite of the tasks runner to get rid of a race condition
Improve the way the inference resolution is working:
- More flexible implementation, so it will be easier to resolve inference ids from plans and functions
- Reduced footprint in the PreAnalyzer
InferenceOperator simplification
- Remove the extra step in addInput
- Simplification of the output building
- Better test coverages (using randomized input/output columns)

Related Issue:

ES|QL: Add TEXT_EMBEDDING function #131022

elasticsearchmachine · 2025-07-28T07:00:51Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

afoucret · 2025-07-28T07:02:33Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java

-            new ResolveInference(),
            new ResolveLookupTables(),
            new ResolveFunctions(),
+            new ResolveInference(),


ℹ️ Inference resolution is moved after function resolution, so we can resolve inference ids use in functions like TEXT_EMBEDDING

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/PreAnalyzer.java

afoucret · 2025-07-28T07:08:46Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/inference/InferenceResolver.java

+/**
+ * Collects and resolves inference deployments inference IDs from ES|QL logical plans.
+ */
+public class InferenceResolver {


ℹ️ Now a separate class as the logic of inference resolution is becoming much more complex.

afoucret · 2025-07-28T07:11:22Z

...ugin/esql/src/main/java/org/elasticsearch/xpack/esql/inference/bulk/BulkInferenceRunner.java

+ * and other non-thread-safe components.
+ * </p>
+ */
+public class BulkInferenceRunner {


ℹ️ This is a full rewrite of the InferenceRunner that was causing race conditions.

ioanatia

There is a race condition mentioned in the PR description.
What is the race condition?
Do we need to backport a fix to 9.1 if we found a race condition since COMPLETION is already released for 9.1?
Is it necessary to rewrite the implementation of BulkInferenceExecutor to fix the race condition? Or can we have a more surgical fix that we can review separately?

...in/esql/src/test/java/org/elasticsearch/xpack/esql/inference/rerank/RerankOperatorTests.java

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/execution/PlanExecutor.java

ioanatia · 2025-07-29T11:18:18Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java

        }
    }

-    private static class ResolveInference extends ParameterizedAnalyzerRule<InferencePlan<?>, AnalyzerContext> {


Is there any difference between this implementation and the new one you added?
I don't think it is - they both do the same thing, even if this one overrides the rule method and the new one the apply one.

So this diff here looks like it was entirely unnecessary.
If we needed to make a change to ResolveInference, we could have made it here, not move it entirely later in the file. This forces the reviewers to check each line to figure out what actually changed.

Maybe the one thing that we needed in Analyzer.java was the order in which we apply rules. But even that is debatable, because it's not necessary yet, we could have switched the order when we actually add the text_embedding function. 🤷‍♀️

There is no major difference but the new method is easier to extends since I can not chain more transform.

return plan.transformDown(InferencePlan.class, p -> resolveInferencePlan(p, context));

will become when implementing the resolution of InferenceFunction

return plan.transformDown(InferencePlan.class, p -> resolveInferencePlan(p, context)) .transformExpressionsOnly(InferenceFunction.class, f -> resolveInferenceFunction(f, context));

This is the whole point of this PR to anticipate change that are required in the existing framework and to isolate these changes, so we can focus on testing potential regression.

BTW, I am 100% sure that we will have to change the order of the resolution and that I will use this change to finish the implementation of TEXT_EMBEDDING.

...sql/src/main/java/org/elasticsearch/xpack/esql/inference/bulk/BulkInferenceRunnerConfig.java

afoucret · 2025-07-30T12:35:56Z

The race condition can happen when submitted several batch of inferences.
If previous batches have already exhausted the number of allocated permits, the newly submitted batch will never be started and. As a consequence the listener will never be called and the request will timeout.

In our execution model we are submitted one batch per page (then each request inside the batch are executed in parallel) up to 10 concurrent batches (this last point is managed by the AsyncOperator). It means that the problem affects mostly big request that are handling multiple pages (hundreds / thousands of row). So not the most likely case.

I have added a test to verify this parallel execution that I will be adding to the 9.1 branch :

    public void testParallelBulkExecution() throws Exception {
        int batches = between(50, 100);
        CountDownLatch latch = new CountDownLatch(batches);

        for (int i = 0; i < batches; i++) {
            List<InferenceAction.Request> requests = randomInferenceRequestList(between(1, 1_000));
            List<InferenceAction.Response> responses = randomInferenceResponseList(requests.size());

            Client client = mockClient(invocation -> {
                runWithRandomDelay(() -> {
                    ActionListener<InferenceAction.Response> l = invocation.getArgument(2);
                    l.onResponse(responses.get(requests.indexOf(invocation.getArgument(1, InferenceAction.Request.class))));
                });
                return null;
            });

            ActionListener<List<InferenceAction.Response>> listener = ActionListener.wrap(r -> {
                assertThat(r, equalTo(responses));
                LogManager.getLogger(BulkInferenceRunnerTests.class).warn("Received [{}] responses", responses.size());
                latch.countDown();
            }, ESTestCase::fail);

            inferenceRunnerFactory(client).create(randomBulkExecutionConfig()).executeBulk(requestIterator(requests), listener);
        }

        latch.await();
    }

I think a surgical patch will be not be to complicated.

...esql/src/test/java/org/elasticsearch/xpack/esql/inference/bulk/BulkInferenceRunnerTests.java

afoucret · 2025-07-30T13:29:38Z

@ioanatia Here is the patch for the race condition in 9.1: https://github.com/elastic/elasticsearch/pull/130991/files

…ilure.

tteofili

LGTM, nice work Aurelien 💯

tteofili · 2025-07-31T15:29:49Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/inference/InferenceResolver.java

+/**
+ * Collects and resolves inference deployments inference IDs from ES|QL logical plans.
+ */
+public class InferenceResolver {


tteofili · 2025-07-31T15:31:36Z

.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/inference/rerank/RerankOperator.java

    protected RerankOperatorRequestIterator requests(Page inputPage) {
-        int inputBlockChannel = inputPage.getBlockCount() - 1;
-        return new RerankOperatorRequestIterator(inputPage.getBlock(inputBlockChannel), inferenceId(), queryText, batchSize);
+        return new RerankOperatorRequestIterator((BytesRefBlock) rowEncoder.eval(inputPage), inferenceId(), queryText, batchSize);


is it always a BytesRefBlock ? probably it is, but I wonder if we should be safe and do a check here?

There is a small issue here but I will handle it as part of another PR because it does also applies to branch 9.1 and 8.19, so it will be easier to backport.

...src/main/java/org/elasticsearch/xpack/esql/inference/rerank/RerankOperatorOutputBuilder.java

* upstream/main: (822 commits) Improve Semantic Text Exists Query Tests (elastic#132283) Make hierarchical k-means over centroids cheaper (elastic#132316) Remove unnecessary listener.delegateFailure in IndexShard#ensureMutable (elastic#132294) Add missing release note (elastic#132319) Unmute elastic#131803 (elastic#132295) Include bytes for live docs in ShardFieldStats (elastic#132232) Fix default missing index sort value of data_nanos pre 7.14 (elastic#132162) [DiskBBQ] Quantize centroids using 7 bits instead of 4 bits (elastic#132261) Use panamized version for windows in Int7VectorScorer (elastic#132311) Mute org.elasticsearch.xpack.ml.integration.AutodetectMemoryLimitIT testTooManyByAndOverFields elastic#132310 Mute org.elasticsearch.xpack.ml.integration.AutodetectMemoryLimitIT testManyDistinctOverFields elastic#132308 Update 8.17 version to 8.17.10 (elastic#132303) Mute org.elasticsearch.datastreams.DataStreamsClientYamlTestSuiteIT test {p0=data_stream/10_basic/Create hidden data stream with match all template} elastic#132298 Add random queries to logsdb data generation tests (elastic#132109) ES|QL Inference runner refactoring (elastic#131986) Add basic example to linear-retriever.md (elastic#132196) Refactor RemoteClusterService to be multi-project aware (elastic#131894) ESQL: Mark csv-spec tests (elastic#132098) Mute org.elasticsearch.common.logging.JULBridgeTests testThrowable elastic#132280 Bump versions after 8.19.0 release ...

elasticsearchmachine added the v9.2.0 label Jul 28, 2025

afoucret added >non-issue :Analytics/ES|QL AKA ESQL labels Jul 28, 2025

afoucret mentioned this pull request Jul 28, 2025

ES|QL: Add TEXT_EMBEDDING function #131022

Closed

6 tasks

afoucret marked this pull request as ready for review July 28, 2025 07:00

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 28, 2025

afoucret commented Jul 28, 2025

View reviewed changes

ioanatia reviewed Jul 29, 2025

View reviewed changes

afoucret commented Jul 30, 2025

View reviewed changes

...esql/src/test/java/org/elasticsearch/xpack/esql/inference/bulk/BulkInferenceRunnerTests.java Outdated Show resolved Hide resolved

afoucret mentioned this pull request Jul 31, 2025

Fix BulkInferenceExecutorTests timeout caused by a race condition. #130991

Merged

afoucret force-pushed the esql-inference-runner-refactoring branch from 615d89d to 56d6597 Compare July 31, 2025 09:50

afoucret requested review from a team as code owners July 31, 2025 09:50

afoucret added 13 commits July 31, 2025 12:14

Quite a big refactoring of ES|QL inference execution implementation.

e60510b

lint

879f570

Small improvements to the PreAnalyzer

aa77791

Reduced footprint on the PreAnalyzer

ad782b9

Additional cleaning in PreAnalyzer

7863781

Moving back InferenceRunnerConfig to BulkInferenceRunnerConfig

fd81884

Moving back InferenceRunnerConfig to BulkInferenceRunnerConfig

c6c7aac

Ensure CannedSourceOperator::deepCopyOf properly release blocks on fa…

29c33f2

…ilure.

Introduce an InferenceService

273006c

Fix typo

a324b32

Reverting useless change.

d77125a

Remove useless file

15ec18d

Add more test case.

ef60c09

afoucret and others added 7 commits July 31, 2025 12:19

Remove trailing debug log

db8cdb2

Small test improvement.

44f35fc

Better implementation of OutputBuilder.

6ae9a03

Remove useless line of code.

c4ae133

Fix import

933f484

Apply new RERANK/COMPLETION syntax

8c93015

Remove useless page release.

57fe411

afoucret force-pushed the esql-inference-runner-refactoring branch from 56d6597 to 57fe411 Compare July 31, 2025 10:20

[CI] Auto commit changes from spotless

a570854

tteofili approved these changes Jul 31, 2025

View reviewed changes

afoucret merged commit dd3e3c9 into elastic:main Jul 31, 2025
33 checks passed

ES|QL Inference runner refactoring #131986

ES|QL Inference runner refactoring #131986

Uh oh!

Conversation

afoucret commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes:

Related Issue:

Uh oh!

elasticsearchmachine commented Jul 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ioanatia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ioanatia Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

afoucret commented Jul 30, 2025

Uh oh!

Uh oh!

afoucret commented Jul 30, 2025

Uh oh!

tteofili left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

afoucret commented Jul 28, 2025 •

edited

Loading

ioanatia Jul 29, 2025 •

edited

Loading