Add inner hits support to semantic query #111834

Mikep86 · 2024-08-13T12:08:16Z

Adds inner hits support to the semantic query through a restricted inner_hits parameter, which exposes from and size from the inner_hits options. We chose to do this because the other inner hits options don't make sense in the context of the semantic query.

The API and response looks like:

PUT test-index
{
    "mappings": {
        "properties": {
            "infer_field": {
                "type": "semantic_text",
                "inference_id": "my-elser-endpoint"
            }
        }
    }
}

PUT test-index/_doc/doc1
{
    "infer_field": ["these are not the droids you're looking for. He's free to go around", "next time on battle bots" ]
}

GET test-index/_search
{
    "query": {
        "semantic": {
            "field": "infer_field",
            "query": "robots",
            "inner_hits": {
                "from": 0,
                "size": 2
            }
        }
    },
    "_source": [ "infer_field.text" ]
}

{
    "took": 87,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 4.733991,
        "hits": [
            {
                "_index": "test-index",
                "_id": "doc1",
                "_score": 4.733991,
                "_source": {
                    "infer_field": {
                        "text": [
                            "these are not the droids you're looking for. He's free to go around",
                            "next time on battle bots"
                        ]
                    }
                },
                "inner_hits": {
                    "infer_field": {
                        "hits": {
                            "total": {
                                "value": 2,
                                "relation": "eq"
                            },
                            "max_score": 4.733991,
                            "hits": [
                                {
                                    "_index": "test-index",
                                    "_id": "doc1",
                                    "_nested": {
                                        "field": "infer_field.inference.chunks",
                                        "offset": 1
                                    },
                                    "_score": 4.733991,
                                    "_source": {
                                        "text": "next time on battle bots"
                                    }
                                },
                                {
                                    "_index": "test-index",
                                    "_id": "doc1",
                                    "_nested": {
                                        "field": "infer_field.inference.chunks",
                                        "offset": 0
                                    },
                                    "_score": 4.283129,
                                    "_source": {
                                        "text": "these are not the droids you're looking for. He's free to go around"
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

elasticsearchmachine · 2024-08-13T12:08:41Z

Hi @Mikep86, I've created a changelog YAML for you.

kderusso

Technical approach looks reasonable to me, nice work!

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/queries/InnerChunkBuilder.java

...inference/src/yamlRestTest/resources/rest-api-spec/test/inference/40_semantic_text_query.yml

jimczi

+1 for the approach, I left some minor comments.

...gin/inference/src/main/java/org/elasticsearch/xpack/inference/queries/InnerChunkBuilder.java

.../inference/src/main/java/org/elasticsearch/xpack/inference/queries/SemanticQueryBuilder.java

joemcelroy · 2024-08-13T16:51:01Z

what happens when I query on two semantic_text fields - does it collide under the inner_hits.chunks path? We should probably specify a custom name thats unique to the field to avoid collisions.

My first thought is that the retrieved chunks could be positioned more closely to the field source ie _source.infer_field.chunks but theres a challenge when also retrieved via fields.

Looking at the API interface, I think chunking configuration should step away from being close to configuration of inner_hits and more a developer needs for chunking retrieval.

For example: I would see a need to describe retrieval of adjacent chunks on matched chunks. These chunks tend to be small (for precision) but not very helpful to the LLM for context. It would be great if we could specify bring back adjacent chunks to the matched chunk.

Example: does from parameter make sense in this example?

Mikep86 · 2024-08-13T17:21:48Z

@joemcelroy

what happens when I query on two semantic_text fields - does it collide under the inner_hits.chunks path? We should probably specify a custom name thats unique to the field to avoid collisions.

It does indeed! Good callout, I will fix this.

My first thought is that the retrieved chunks could be positioned more closely to the field source ie _source.infer_field.chunks but theres a challenge when also retrieved via fields.

I don't quite follow here. Do you mean that the nested path infer_field.inference.chunks is too verbose?

Looking at the API interface, I think chunking configuration should step away from being close to configuration of inner_hits and more a developer needs for chunking retrieval.

For example: I would see a need to describe retrieval of adjacent chunks on matched chunks. These chunks tend to be small (for precision) but not very helpful to the LLM for context. It would be great if we could specify bring back adjacent chunks to the matched chunk.

Example: does from parameter make sense in this example?

Agreed that this could be useful, but it's quite the scope expansion. I think that would require implementing a new low-level inner hits retrieval mechanism specifically for retrieving chunks. IMO that's best left as a follow up to this, WDYT?

I think the from parameter will always make sense as a pagination mechanism, but how it operates would potentially change with chunk-specific inner hits retrieval.

…hunks

...inference/src/yamlRestTest/resources/rest-api-spec/test/inference/40_semantic_text_query.yml

…tible

joemcelroy · 2024-08-13T19:40:52Z

I don't quite follow here. Do you mean that the nested path infer_field.inference.chunks is too verbose?

To me as a developer, i don't expect the chunk matches to come under inner_hits. It smells that we are leaking the implementation details to the user (the user doesn't know its stored as nested field values, they only see that its been chunked behind the scenes as a single field). My thought came from "what if it was closer to the field object that are my result fields", though I don't like the idea of being contained within _source.infer_field.chunks.

Maybe the API request could look like this:

GET test-index/_search
{
    "query": {
        "semantic": {
            "field": "infer_field",
            "query": "robots",
            "chunks": {
                "size": 2
            }
        }
    },
    "_source": [ "infer_field.text" ]
}

Maybe the API response could look like

{
    "took": 87,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 4.733991,
        "hits": [
            {
                "_index": "test-index",
                "_id": "doc1",
                "_score": 4.733991,
                "_source": {
                    "infer_field": {
                        "text": [
                            "these are not the droids you're looking for. He's free to go around",
                            "next time on battle bots"
                        ]
                    }
                },
                "chunks": {
                    "infer_field": [
                                {
                                    "_score": 4.733991,
                                     "text": "next time on battle bots"
                                },
                                {
                                     "_score": 2.733991,
                                      "text": "these are not the droids you're looking for. He's free to go around"
                                }
                            ]
                    }
                }
            }
        ]
    }
}

My suggestion is we design an interface thats focused on chunking needs vs copying the nested field query design, making sure we are building in a design that sets us up for future features. The examples weren't requirements for this PR but more from surfacing possible chunking requirements.

Also worth questioning whether returning the embedding is important here. By default it shouldn't be returned IMO.

elasticsearchmachine · 2024-08-14T15:55:20Z

Pinging @elastic/ent-search-eng (Team:SearchOrg)

benwtrent · 2024-08-14T16:58:32Z

My 2 cents is:

Either we change the parameter chunks to inner_hits and place restrictions/set specific defaults on it (easy)
Change the response value name to chunks (difficult)

Users will always have to read documentation on what the parameter name is. "chunks" vs "parts" vs. "passages" vs. "inner_hits".

Mikep86 · 2024-08-14T17:26:48Z

My 2 cents is:

Either we change the parameter chunks to inner_hits and place restrictions/set specific defaults on it (easy)

Change the response value name to chunks (difficult)

In the spirit of progress over perfection, the first option allows us to merge this change and utilize passage ranking through the semantic query now. That makes it the clear choice for the moment IMO.

Given the LOE to make the second option happen, I just don't see it happening without a discovery phase with a lot of stakeholders and multiple PRs. I would likely spend the majority of 8.16 development time on changes related to the search response format, at the cost of other semantic text features we want to include in 8.16.

I also think there's a third option to keep the API as it is now though. chunks is basically the request-side API we want, so if we keep it there is one less breaking change to make in the future.

carlosdelest

LGTM. 💯

I agree on the approach to use the inner_hits name in the request, and keep the inner_hits structure.

We're already leaking the inner workings of semantic_text - we're just making them convenient to use in terms of ingestion and query. But we're not hiding the details in case users want to dig deeper.

I understand @joemcelroy statement on making it better from the dev perspective. From my side, I think we benefit more of alignment as of now - we're using inner_hits config and structure, which is familiar to use and is the inner representation.

We can include the chunks parameter later on to make something more aligned to the dev experience that builds on top of this change.