diff --git a/docs/reference/rest-api/common-parms.asciidoc b/docs/reference/rest-api/common-parms.asciidoc index 162e486158a95..8c1a922eeb8a3 100644 --- a/docs/reference/rest-api/common-parms.asciidoc +++ b/docs/reference/rest-api/common-parms.asciidoc @@ -1310,8 +1310,26 @@ See <>. end::wait_for_active_shards[] tag::rrf-retrievers[] + +[NOTE] +==== +Either `query` or `retrievers` must be specified. +Combining `query` and `retrievers` is not supported. +==== + +`query`:: +(Optional, String) ++ +The query to use when using the <>. + +`fields`:: +(Optional, array of strings) ++ +The fields to query when using the <>. +If not specified, uses the index's default fields from the `index.query.default_field` index setting, which is `*` by default. + `retrievers`:: -(Required, array of retriever objects) +(Optional, array of retriever objects) + A list of child retrievers to specify which sets of returned top documents will have the RRF formula applied to them. Each child retriever carries an @@ -1337,7 +1355,7 @@ This value determines the size of the individual result sets per query. A higher value will improve result relevance at the cost of performance. The final ranked result set is pruned down to the search request's <>. `rank_window_size` must be greater than or equal to `size` and greater than or equal to `1`. -Defaults to the `size` parameter. +Defaults to 10. end::compound-retriever-rank-window-size[] tag::compound-retriever-filter[] @@ -1349,39 +1367,68 @@ according to each retriever's specifications. end::compound-retriever-filter[] tag::linear-retriever-components[] + +[NOTE] +==== +Either `query` or `retrievers` must be specified. +Combining `query` and `retrievers` is not supported. +==== + +`query`:: +(Optional, String) ++ +The query to use when using the <>. + +`fields`:: +(Optional, array of strings) ++ +The fields to query when using the <>. +Fields can include boost values using the `^` notation (e.g., `"field^2"`). +If not specified, uses the index's default fields from the `index.query.default_field` index setting, which is `*` by default. + +`normalizer`:: +(Optional, String) ++ +The normalizer to use when using the <>. +See <> for supported values. +Required when `query` is specified. ++ +[WARNING] +==== +Avoid using `none` as that will disable normalization and may bias the result set towards lexical matches. +See <> for more information. +==== + `retrievers`:: -(Required, array of objects) +(Optional, array of objects) + A list of the sub-retrievers' configuration, that we will take into account and whose result sets we will merge through a weighted sum. Each configuration can have a different weight and normalization depending on the specified retriever. -Each entry specifies the following parameters: +include::common-parms.asciidoc[tag=compound-retriever-rank-window-size] + +include::common-parms.asciidoc[tag=compound-retriever-filter] -* `retriever`:: +Each entry in the `retrievers` array specifies the following parameters: + +`retriever`:: (Required, a <> object) + Specifies the retriever for which we will compute the top documents for. The retriever will produce `rank_window_size` results, which will later be merged based on the specified `weight` and `normalizer`. -* `weight`:: +`weight`:: (Optional, float) + The weight that each score of this retriever's top docs will be multiplied with. Must be greater or equal to 0. Defaults to 1.0. -* `normalizer`:: +`normalizer`:: (Optional, String) + -Specifies how we will normalize the retriever's scores, before applying the specified `weight`. -Available values are: `minmax`, and `none`. Defaults to `none`. - -** `none` -** `minmax` : -A `MinMaxScoreNormalizer` that normalizes scores based on the following formula -+ -``` -score = (score - min) / (max - min) -``` +Specifies how the retriever’s score will be normalized before applying the specified `weight`. +See <> for supported values. +Defaults to `none`. See also <> using a linear retriever on how to independently configure and apply normalizers to retrievers. diff --git a/docs/reference/search/retriever.asciidoc b/docs/reference/search/retriever.asciidoc index a3cc4734fd23a..c8636162af0e0 100644 --- a/docs/reference/search/retriever.asciidoc +++ b/docs/reference/search/retriever.asciidoc @@ -121,6 +121,28 @@ POST /restaurants/_bulk?refresh PUT /movies +PUT /books +{ + "mappings": { + "properties": { + "title": { + "type": "text", + "copy_to": "title_semantic" + }, + "description": { + "type": "text", + "copy_to": "description_semantic" + }, + "title_semantic": { + "type": "semantic_text" + }, + "description_semantic": { + "type": "semantic_text" + } + } + } +} + PUT _query_rules/my-ruleset { "rules": [ @@ -151,6 +173,8 @@ PUT _query_rules/my-ruleset DELETE /restaurants DELETE /movies + +DELETE /books -------------------------------------------------- // TEARDOWN //// @@ -282,9 +306,19 @@ A retriever that normalizes and linearly combines the scores of other retrievers include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=linear-retriever-components] -include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=compound-retriever-rank-window-size] +[[linear-retriever-normalizers]] +===== Normalizers -include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=compound-retriever-filter] +The `linear` retriever supports the following normalizers: + +* `none`: No normalization +* `minmax`: Normalizes scores based on the following formula: ++ +.... +score = (score - min) / (max - min) +.... + +* `l2_norm`: Normalizes scores using the L2 norm of the score values [[rrf-retriever]] ==== RRF Retriever @@ -912,6 +946,202 @@ GET movies/_search <1> The `rule` retriever is the outermost retriever, applying rules to the search results that were previously reranked using the `rrf` retriever. <2> The `rrf` retriever returns results from all of its sub-retrievers, and the output of the `rrf` retriever is used as input to the `rule` retriever. +[discrete] +[[multi-field-query-format]] +=== Multi-field query format + +The `linear` and `rrf` retrievers support a multi-field query format that provides a simplified way to define searches across multiple fields without explicitly specifying inner retrievers. +This format automatically generates appropriate inner retrievers based on the field types and query parameters. +This is a great way to search an index, knowing little to nothing about its schema, while also handling normalization across lexical and semantic matches. + +[discrete] +[[multi-field-field-grouping]] +==== Field grouping + +The multi-field query format groups queried fields into two categories: + +- **Lexical fields**: fields that support term queries, such as `keyword` and `text` fields. +- **Semantic fields**: <>. + +Each field group is queried separately and the scores/ranks are normalized such that each contributes 50% to the final score/rank. +This balances the importance of lexical and semantic fields. +Most indices contain more lexical than semantic fields, and without this grouping the results would often bias towards lexical field matches. + +[WARNING] +==== +In the `linear` retriever, this grouping relies on using a normalizer other than `none` (i.e., `minmax` or `l2_norm`). +If you use the `none` normalizer, the scores across field groups will not be normalized and the results may be biased towards lexical field matches. +==== + +[discrete] +[[multi-field-field-boosting]] +==== Linear retriever field boosting + +When using the `linear` retriever, fields can be boosted using the `^` notation: + +[source,console] +---- +GET books/_search +{ + "retriever": { + "linear": { + "query": "elasticsearch", + "fields": [ + "title^3", <1> + "description^2", <2> + "title_semantic", <3> + "description_semantic^2" + ], + "normalizer": "minmax" + } + } +} +---- +// TEST[continued] + +<1> 3x weight +<2> 2x weight +<3> 1x weight (default) + +Due to how the <> are normalized, per-field boosts have no effect on the range of the final score. +Instead, they affect the importance of the field's score within its group. + +For example, if the schema looks like: + +[source,console] +---- +PUT /books +{ + "mappings": { + "properties": { + "title": { + "type": "text", + "copy_to": "title_semantic" + }, + "description": { + "type": "text", + "copy_to": "description_semantic" + }, + "title_semantic": { + "type": "semantic_text" + }, + "description_semantic": { + "type": "semantic_text" + } + } + } +} +---- +// TEST[skip:index created in test setup] + +And we run this query: + +[source,console] +---- +GET books/_search +{ + "retriever": { + "linear": { + "query": "elasticsearch", + "fields": [ + "title", + "description", + "title_semantic", + "description_semantic" + ], + "normalizer": "minmax" + } + } +} +---- +// TEST[continued] + +The score breakdown would be: + +* Lexical fields (50% of score): + ** `title`: 50% of lexical fields group score, 25% of final score + ** `description`: 50% of lexical fields group score, 25% of final score +* Semantic fields (50% of score): + ** `title_semantic`: 50% of semantic fields group score, 25% of final score + ** `description_semantic`: 50% of semantic fields group score, 25% of final score + +If we apply per-field boosts like so: + +[source,console] +---- +GET books/_search +{ + "retriever": { + "linear": { + "query": "elasticsearch", + "fields": [ + "title^3", + "description^2", + "title_semantic", + "description_semantic^2" + ], + "normalizer": "minmax" + } + } +} +---- +// TEST[continued] + +The score breakdown would change to: + +* Lexical fields (50% of score): + ** `title`: 60% of lexical fields group score, 30% of final score + ** `description`: 40% of lexical fields group score, 20% of final score +* Semantic fields (50% of score): + ** `title_semantic`: 33% of semantic fields group score, 16.5% of final score + ** `description_semantic`: 66% of semantic fields group score, 33% of final score + +[discrete] +[[multi-field-wildcard-field-patterns]] +==== Wildcard field patterns + +Field names support the `*` wildcard character to match multiple fields: + +[source,console] +---- +GET books/_search +{ + "retriever": { + "rrf": { + "query": "machine learning", + "fields": [ + "title*", <1> + "*_text" <2> + ] + } + } +} +---- +// TEST[continued] + +<1> Match fields that start with `title` +<2> Match fields that end with `_text` + +Note, however, that wildcard field patterns will only resolve to fields that either: + +- Support term queries, such as `keyword` and `text` fields +- Are `semantic_text` fields + +[discrete] +[[multi-field-limitations]] +==== Limitations + +- **Single index**: Multi-field queries currently work with single index searches only +- **CCS (Cross Cluster Search)**: Multi-field queries do not support remote cluster searches + +[discrete] +[[multi-field-examples]] +==== Examples + +- <> +- <> + + [discrete] [[retriever-common-parameters]] === Common usage guidelines diff --git a/docs/reference/search/search-your-data/retrievers-examples.asciidoc b/docs/reference/search/search-your-data/retrievers-examples.asciidoc index 5ff97673b8926..e0a97c8ffc896 100644 --- a/docs/reference/search/search-your-data/retrievers-examples.asciidoc +++ b/docs/reference/search/search-your-data/retrievers-examples.asciidoc @@ -30,7 +30,11 @@ PUT retrievers_example } }, "text": { - "type": "text" + "type": "text", + "copy_to": "text_semantic" + }, + "text_semantic": { + "type": "semantic_text" }, "year": { "type": "integer" @@ -285,32 +289,32 @@ This returns the following response based on the normalized weighted score for e "value": 3, "relation": "eq" }, - "max_score": -1, + "max_score": 3.5, "hits": [ { "_index": "retrievers_example", "_id": "2", - "_score": -1 + "_score": 3.5 }, { "_index": "retrievers_example", "_id": "1", - "_score": -2 + "_score": 2.3 }, { "_index": "retrievers_example", "_id": "3", - "_score": -3 + "_score": 0.1 } ] } } ---- // TESTRESPONSE[s/"took": 42/"took": $body.took/] -// TESTRESPONSE[s/"max_score": -1/"max_score": $body.hits.max_score/] -// TESTRESPONSE[s/"_score": -1/"_score": $body.hits.hits.0._score/] -// TESTRESPONSE[s/"_score": -2/"_score": $body.hits.hits.1._score/] -// TESTRESPONSE[s/"_score": -3/"_score": $body.hits.hits.2._score/] +// TESTRESPONSE[s/"max_score": 3.5/"max_score": $body.hits.max_score/] +// TESTRESPONSE[s/"_score": 3.5/"_score": $body.hits.hits.0._score/] +// TESTRESPONSE[s/"_score": 2.3/"_score": $body.hits.hits.1._score/] +// TESTRESPONSE[s/"_score": 0.1/"_score": $body.hits.hits.2._score/] ============== By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, @@ -402,38 +406,304 @@ Which would return the following results: "value": 4, "relation": "eq" }, - "max_score": -1, + "max_score": 3.5, + "hits": [ + { + "_index": "retrievers_example", + "_id": "3", + "_score": 3.5 + }, + { + "_index": "retrievers_example", + "_id": "2", + "_score": 2.0 + }, + { + "_index": "retrievers_example", + "_id": "4", + "_score": 1.1 + }, + { + "_index": "retrievers_example", + "_id": "1", + "_score": 0.1 + } + ] + } +} +---- +// TESTRESPONSE[s/"took": 42/"took": $body.took/] +// TESTRESPONSE[s/"max_score": 3.5/"max_score": $body.hits.max_score/] +// TESTRESPONSE[s/"_score": 3.5/"_score": $body.hits.hits.0._score/] +// TESTRESPONSE[s/"_score": 2.0/"_score": $body.hits.hits.1._score/] +// TESTRESPONSE[s/"_score": 1.1/"_score": $body.hits.hits.2._score/] +// TESTRESPONSE[s/"_score": 0.1/"_score": $body.hits.hits.3._score/] +============== + +[discrete] +[[retrievers-examples-rrf-multi-field-query-format]] +==== Example: RRF with the multi-field query format + +There's an even simpler way to execute a hybrid search though: We can use the <>, which allows us to query multiple fields without explicitly specifying inner retrievers. +One of the major challenges with hybrid search is normalizing the scores across matches on all field types. +Scores from <> and <> fields don't always fall in the same range, so we need to normalize the ranks across matches on these fields to generate a result set. +For example, BM25 scores from `text` fields are unbounded, while vector similarity scores from `text_embedding` models are bounded between [0, 1]. +The multi-field query format <>. + +The following example uses the multi-field query format to query every field specified in the `index.query.default_field` index setting, which is set to `*` by default. +This default value will cause the retriever to query every field that either: + +- Supports term queries, such as `keyword` and `text` fields +- Is a `semantic_text` field + +In this example, that would translate to the `text`, `text_semantic`, `year`, `topic`, and `timestamp` fields. + +[source,console] +---- +GET /retrievers_example/_search +{ + "retriever": { + "rrf": { + "query": "artificial intelligence" + } + }, + "_source": false +} +---- +// TEST[continued] + +This returns the following response based on the final rrf score for each result. + +.Example response +[%collapsible] +============== +[source,console-result] +---- +{ + "took": 42, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 5, + "relation": "eq" + }, + "max_score": 0.8333334, + "hits": [ + { + "_index": "retrievers_example", + "_id": "2", + "_score": 0.8333334 + }, + { + "_index": "retrievers_example", + "_id": "3", + "_score": 0.82 + }, + { + "_index": "retrievers_example", + "_id": "4", + "_score": 0.48 + }, + { + "_index": "retrievers_example", + "_id": "1", + "_score": 0.40 + }, + { + "_index": "retrievers_example", + "_id": "5", + "_score": 0.25 + } + ] + } +} +---- +// TESTRESPONSE[s/"took": 42/"took": $body.took/] +// TESTRESPONSE[s/"max_score": 0.8333334/"max_score": $body.hits.max_score/] +// TESTRESPONSE[s/"_score": 0.8333334/"_score": $body.hits.hits.0._score/] +// TESTRESPONSE[s/"_score": 0.82/"_score": $body.hits.hits.1._score/] +// TESTRESPONSE[s/"_score": 0.48/"_score": $body.hits.hits.2._score/] +// TESTRESPONSE[s/"_score": 0.40/"_score": $body.hits.hits.3._score/] +// TESTRESPONSE[s/"_score": 0.25/"_score": $body.hits.hits.4._score/] +============== + +We can also use the `fields` parameter to explicitly specify the fields to query. +The following example uses the multi-field query format to query the `text` and `text_semantic` fields. + +[source,console] +---- +GET /retrievers_example/_search +{ + "retriever": { + "rrf": { + "query": "artificial intelligence", + "fields": ["text", "text_semantic"] + } + }, + "_source": false +} +---- +// TEST[continued] + +[NOTE] +==== +The `fields` parameter also accepts <>. +==== + +This returns the following response based on the final rrf score for each result. + +.Example response +[%collapsible] +============== +[source,console-result] +---- +{ + "took": 42, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 5, + "relation": "eq" + }, + "max_score": 0.8333334, "hits": [ + { + "_index": "retrievers_example", + "_id": "2", + "_score": 0.8333334 + }, { "_index": "retrievers_example", "_id": "3", - "_score": -1 + "_score": 0.82 }, + { + "_index": "retrievers_example", + "_id": "4", + "_score": 0.48 + }, + { + "_index": "retrievers_example", + "_id": "1", + "_score": 0.40 + }, + { + "_index": "retrievers_example", + "_id": "5", + "_score": 0.25 + } + ] + } +} +---- +// TESTRESPONSE[s/"took": 42/"took": $body.took/] +// TESTRESPONSE[s/"max_score": 0.8333334/"max_score": $body.hits.max_score/] +// TESTRESPONSE[s/"_score": 0.8333334/"_score": $body.hits.hits.0._score/] +// TESTRESPONSE[s/"_score": 0.82/"_score": $body.hits.hits.1._score/] +// TESTRESPONSE[s/"_score": 0.48/"_score": $body.hits.hits.2._score/] +// TESTRESPONSE[s/"_score": 0.40/"_score": $body.hits.hits.3._score/] +// TESTRESPONSE[s/"_score": 0.25/"_score": $body.hits.hits.4._score/] +============== + +[discrete] +[[retrievers-examples-linear-multi-field-query-format]] +==== Example: Linear retriever with the multi-field query format + +We can also use the <> with the `linear` retriever. +It works much the same way as <>, with a couple key differences: + +- We can use `^` notation to specify a <> +- We must set the `normalizer` parameter to specify the normalization method used to combine <> + +The following example uses the `linear` retriever to query the `text`, `text_semantic`, and `topic` fields, with a boost of 2 on the `topic` field: + +[source,console] +---- +GET /retrievers_example/_search +{ + "retriever": { + "linear": { + "query": "artificial intelligence", + "fields": ["text", "text_semantic", "topic^2"], + "normalizer": "minmax" + } + }, + "_source": false +} +---- +// TEST[continued] + +This returns the following response based on the normalized score for each result: + +.Example response +[%collapsible] +============== +[source,console-result] +---- +{ + "took": 42, + "timed_out": false, + "_shards": { + "total": 1, + "successful": 1, + "skipped": 0, + "failed": 0 + }, + "hits": { + "total": { + "value": 5, + "relation": "eq" + }, + "max_score": 2.0, + "hits": [ { "_index": "retrievers_example", "_id": "2", - "_score": -2 + "_score": 2.0 + }, + { + "_index": "retrievers_example", + "_id": "3", + "_score": 1.2 }, { "_index": "retrievers_example", "_id": "4", - "_score": -3 + "_score": 1.0 }, { "_index": "retrievers_example", "_id": "1", - "_score": -4 + "_score": 0.8 + }, + { + "_index": "retrievers_example", + "_id": "5", + "_score": 0.1 } ] } } ---- // TESTRESPONSE[s/"took": 42/"took": $body.took/] -// TESTRESPONSE[s/"max_score": -1/"max_score": $body.hits.max_score/] -// TESTRESPONSE[s/"_score": -1/"_score": $body.hits.hits.0._score/] -// TESTRESPONSE[s/"_score": -2/"_score": $body.hits.hits.1._score/] -// TESTRESPONSE[s/"_score": -3/"_score": $body.hits.hits.2._score/] -// TESTRESPONSE[s/"_score": -4/"_score": $body.hits.hits.3._score/] +// TESTRESPONSE[s/"max_score": 2.0/"max_score": $body.hits.max_score/] +// TESTRESPONSE[s/"_score": 2.0/"_score": $body.hits.hits.0._score/] +// TESTRESPONSE[s/"_score": 1.2/"_score": $body.hits.hits.1._score/] +// TESTRESPONSE[s/"_score": 1.0/"_score": $body.hits.hits.2._score/] +// TESTRESPONSE[s/"_score": 0.8/"_score": $body.hits.hits.3._score/] +// TESTRESPONSE[s/"_score": 0.1/"_score": $body.hits.hits.4._score/] ============== [discrete] @@ -1277,59 +1547,65 @@ The output of which, albeit a bit verbose, will provide all the necessary info t "_score": 0.5, "_explanation": { "value": 0.5, - "description": "rrf score: [0.5] computed for initial ranks [0, 1] with rankConstant: [1] as sum of [1 / (rank + rankConstant)] for each query", + "description": "sum of:", "details": [ { - "value": 0.0, - "description": "rrf score: [0], result not found in query at index [0]", - "details": [] - }, - { - "value": 1, - "description": "rrf score: [0.5], for rank [1] in query at index [1] computed as [1 / (1 + 1)], for matching query with score", + "value": 0.5, + "description": "rrf score: [0.5] computed for initial ranks [0, 1] with rankConstant: [1] as sum of [1 / (rank + rankConstant)] for each query", "details": [ { - "value": 0.8333334, - "description": "rrf score: [0.8333334] computed for initial ranks [2, 1] with rankConstant: [1] as sum of [1 / (rank + rankConstant)] for each query", + "value": 0.0, + "description": "rrf score: [0], result not found in query at index [0]", + "details": [] + }, + { + "value": 1, + "description": "rrf score: [0.5], for rank [1] in query at index [1] computed as [1 / (1 + 1)], for matching query with score", "details": [ { - "value": 2, - "description": "rrf score: [0.33333334], for rank [2] in query at index [0] computed as [1 / (2 + 1)], for matching query with score", + "value": 0.8333334, + "description": "rrf score: [0.8333334] computed for initial ranks [2, 1] with rankConstant: [1] as sum of [1 / (rank + rankConstant)] for each query", "details": [ { - "value": 2.8129659, - "description": "sum of:", + "value": 2, + "description": "rrf score: [0.33333334], for rank [2] in query at index [0] computed as [1 / (2 + 1)], for matching query with score", "details": [ { - "value": 1.4064829, - "description": "weight(text:information in 0) [PerFieldSimilarity], result of:", + "value": 2.8129659, + "description": "sum of:", "details": [ - *** - ] - }, - { - "value": 1.4064829, - "description": "weight(text:retrieval in 0) [PerFieldSimilarity], result of:", - "details": [ - *** + { + "value": 1.4064829, + "description": "weight(text:information in 1) [PerFieldSimilarity], result of:", + "details": [ + *** + ] + }, + { + "value": 1.4064829, + "description": "weight(text:retrieval in 1) [PerFieldSimilarity], result of:", + "details": [ + *** + ] + } ] } ] - } - ] - }, - { - "value": 1, - "description": "rrf score: [0.5], for rank [1] in query at index [1] computed as [1 / (1 + 1)], for matching query with score", - "details": [ + }, { "value": 1, - "description": "doc [0] with an original score of [1.0] is at rank [1] from the following source queries.", + "description": "rrf score: [0.5], for rank [1] in query at index [1] computed as [1 / (1 + 1)], for matching query with score", "details": [ { - "value": 1.0, - "description": "found vector with calculated similarity: 1.0", - "details": [] + "value": 1, + "description": "doc [1] with an original score of [1.0] is at rank [1] from the following source queries.", + "details": [ + { + "value": 1.0, + "description": "found vector with calculated similarity: 1.0", + "details": [] + } + ] } ] } @@ -1338,6 +1614,22 @@ The output of which, albeit a bit verbose, will provide all the necessary info t ] } ] + }, + { + "value": 0.0, + "description": "match on required clause, product of:", + "details": [ + { + "value": 0.0, + "description": "# clause", + "details": [] + }, + { + "value": 1.0, + "description": "FieldExistsQuery [field=_primary_term]", + "details": [] + } + ] } ] } @@ -1347,8 +1639,7 @@ The output of which, albeit a bit verbose, will provide all the necessary info t } ---- // TESTRESPONSE[s/"took": 42/"took": $body.took/] -// TESTRESPONSE[s/\.\.\./$body.hits.hits.0._explanation.details.1.details.0.details.0.details.0.details.0.details.0/] -// TESTRESPONSE[s/\*\*\*/$body.hits.hits.0._explanation.details.1.details.0.details.0.details.0.details.1.details.0/] +// TESTRESPONSE[s/\*\*\*/$body.hits.hits.0._explanation.details.0.details.1.details.0.details.0.details.0.details.1.details.0/] // TESTRESPONSE[s/jnrdZFKS3abUgWVsVdj2Vg/$body.hits.hits.0._node/] ==============