Skip to content
238 changes: 222 additions & 16 deletions solutions/search/retrievers-examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@

## Add example data [retrievers-examples-setup]

To begin with, let's create the `retrievers_example` index, and add some documents to it. We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering.
To begin with, let's create the `retrievers_example` index, and add some documents to it.
We will set `number_of_shards=1` for our examples to ensure consistent and reproducible ordering.

```console
PUT retrievers_example
Expand All @@ -35,7 +36,11 @@
}
},
"text": {
"type": "text"
"type": "text",
"copy_to": "text_semantic"
},
"text_semantic": {
"type": "semantic_text"
},
"year": {
"type": "integer"
Expand Down Expand Up @@ -103,9 +108,11 @@

## Example: Combining query and kNN with RRF [retrievers-examples-combining-standard-knn-retrievers-with-rrf]

First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query. While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list.
First, let’s examine how to combine two different types of queries: a `kNN` query and a `query_string` query.
While these queries may produce scores in different ranges, we can use Reciprocal Rank Fusion (`rrf`) to combine the results and generate a merged final result list.

To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever. This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this:
To implement this in the retriever framework, we start with the top-level element: our `rrf` retriever.
This retriever operates on top of two other retrievers: a `knn` retriever and a `standard` retriever. Our query structure would look like this:

```console
GET /retrievers_example/_search
Expand Down Expand Up @@ -190,9 +197,13 @@

## Example: Hybrid search with linear retriever [retrievers-examples-linear-retriever]

A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores. Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range.
A different, and more intuitive, way to provide hybrid search, is to linearly combine the top documents of different retrievers using a weighted sum of the original scores.
Since, as above, the scores could lie in different ranges, we can also specify a `normalizer` that would ensure that all scores for the top ranked documents of a retriever lie in a specific range.

To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine. We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever. As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range. We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set.
To implement this, we define a `linear` retriever, and along with a set of retrievers that will generate the heterogeneous results sets that we will combine.
We will solve a problem similar to the above, by merging the results of a `standard` and a `knn` retriever.
As the `standard` retriever’s scores are based on BM25 and are not strictly bounded, we will also define a `minmax` normalizer to ensure that the scores lie in the [0, 1] range.
We will apply the same normalizer to `knn` as well to ensure that we capture the importance of each document within the result set.

So, let’s now specify the `linear` retriever whose final score is computed as follows:

Expand Down Expand Up @@ -263,22 +274,22 @@
"value": 3,
"relation": "eq"
},
"max_score": -1,
"max_score": 3.5,
"hits": [
{
"_index": "retrievers_example",
"_id": "2",
"_score": -1
"_score": 3.5
},
{
"_index": "retrievers_example",
"_id": "1",
"_score": -2
"_score": 2.3
},
{
"_index": "retrievers_example",
"_id": "3",
"_score": -3
"_score": 0.1
}
]
}
Expand All @@ -288,7 +299,8 @@
::::


By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1]. Then, we can easily combine the above with a `knn` retriever as follows:
By normalizing scores and leveraging `function_score` queries, we can also implement more complex ranking strategies, such as sorting results based on their timestamps, assign the timestamp as a score, and then normalizing this score to [0, 1].
Then, we can easily combine the above with a `knn` retriever as follows:

```console
GET /retrievers_example/_search
Expand Down Expand Up @@ -369,27 +381,27 @@
"value": 4,
"relation": "eq"
},
"max_score": -1,
"max_score": 3.5,
"hits": [
{
"_index": "retrievers_example",
"_id": "3",
"_score": -1
"_score": 3.5
},
{
"_index": "retrievers_example",
"_id": "2",
"_score": -2
"_score": 2.0
},
{
"_index": "retrievers_example",
"_id": "4",
"_score": -3
"_score": 1.1
},
{
"_index": "retrievers_example",
"_id": "1",
"_score": -4
"_score": 0.1
}
]
}
Expand All @@ -399,6 +411,200 @@
::::


## Example: RRF with the multi-field query format [retrievers-examples-rrf-multi-field-query-format]
```yaml {applies_to}
stack: ga 9.1
```
There's an even simpler way to execute a hybrid search though: We can use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retreivers.md#multi-field-query-format), which allows us to query multiple fields without explicitly specifying inner retrievers.

Check failure on line 419 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retreivers.md' is not a valid link in the 'elasticsearch' cross link index: https://elastic-docs-link-index.s3.us-east-2.amazonaws.com/elastic/elasticsearch/main/links.json
The following example uses the multi-field query format to query the `text` and `text_semantic` fields.
Scores from [`text`](elasticsearch://reference/elasticsearch/mapping-reference/text.md) and [`semantic_text`](elasticsearch://reference/elasticsearch/mapping-reference/semantic-text.md) fields don't always fall in the same range, so we need to normalize the ranks across matches on these fields to generate a result set.
The multi-field query format [does this for us automatically](elasticsearch://reference/elasticsearch/rest-apis/retreivers.md#multi-field-field-grouping).

Check failure on line 423 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retreivers.md' is not a valid link in the 'elasticsearch' cross link index: https://elastic-docs-link-index.s3.us-east-2.amazonaws.com/elastic/elasticsearch/main/links.json

```console
GET /retrievers_example/_search
{
"retriever": {
"rrf": {
"query": "artificial intelligence",
"fields": ["text", "text_semantic"]
}
}
}
```

This returns the following response based on the final rrf score for each result.

::::{dropdown} Example response
```console-result
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 0.8333334,
"hits": [
{
"_index": "retrievers_example",
"_id": "1",
"_score": 0.8333334
},
{
"_index": "retrievers_example",
"_id": "2",
"_score": 0.8333334
},
{
"_index": "retrievers_example",
"_id": "3",
"_score": 0.25
}
]
}
}
```

::::

We don't even need to specify the `fields` parameter when using the multi-field query format.
If we omit it, the retriever will automatically query every field that either:

- Support term queries, such as `keyword` and `text` fields
- Are `semantic_text` fields

In this example, that would translate to the `text`, `text_semantic`, `year`, `topic`, and `timestamp` fields.

```console
GET /retrievers_example/_search
{
"retriever": {
"rrf": {
"query": "artificial intelligence"
}
}
}
```

This returns the following response based on the final rrf score for each result.

::::{dropdown} Example response
```console-result
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 0.8333334,
"hits": [
{
"_index": "retrievers_example",
"_id": "1",
"_score": 0.8333334
},
{
"_index": "retrievers_example",
"_id": "2",
"_score": 0.8333334
},
{
"_index": "retrievers_example",
"_id": "3",
"_score": 0.25
}
]
}
}
```

::::


## Example: Linear retriever with the multi-field query format [retrievers-examples-linear-multi-field-query-format]
```yaml {applies_to}
stack: ga 9.1
```

We can also use the [multi-field query format](elasticsearch://reference/elasticsearch/rest-apis/retreivers.md#multi-field-query-format) with the `linear` retriever.

Check failure on line 546 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retreivers.md' is not a valid link in the 'elasticsearch' cross link index: https://elastic-docs-link-index.s3.us-east-2.amazonaws.com/elastic/elasticsearch/main/links.json
It works much the same way as [on the `rrf` retriever](#retrievers-examples-rrf-multi-field-query-format), with a couple key differences:

- We can use `^` notation to specify a [per-field boost](elasticsearch://reference/elasticsearch/rest-apis/retreivers.md#multi-field-field-boosting)

Check failure on line 549 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retreivers.md' is not a valid link in the 'elasticsearch' cross link index: https://elastic-docs-link-index.s3.us-east-2.amazonaws.com/elastic/elasticsearch/main/links.json
- We must set the `normalizer` parameter to specify the normalization method used to combine [field group scores](elasticsearch://reference/elasticsearch/rest-apis/retreivers.md#multi-field-field-grouping)

Check failure on line 550 in solutions/search/retrievers-examples.md

View workflow job for this annotation

GitHub Actions / preview / build

'reference/elasticsearch/rest-apis/retreivers.md' is not a valid link in the 'elasticsearch' cross link index: https://elastic-docs-link-index.s3.us-east-2.amazonaws.com/elastic/elasticsearch/main/links.json

The following example uses the `linear` retriever to query the `text`, `text_semantic`, and `topic` fields, with a boost of 2 on the `topic` field:

```console
GET /retrievers_example/_search
{
"retriever": {
"linear": {
"query": "artificial intelligence",
"fields": ["text", "text_semantic", "topic^2"],
"normalizer": "minmax"
}
}
}
```

This returns the following response based on the normalized score for each result:

::::{dropdown} Example response
```console-result
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 2.0,
"hits": [
{
"_index": "retrievers_example",
"_id": "2",
"_score": 2.0
},
{
"_index": "retrievers_example",
"_id": "1",
"_score": 1.2
},
{
"_index": "retrievers_example",
"_id": "3",
"_score": 0.1
}
]
}
}
```

::::

## Example: Grouping results by year with `collapse` [retrievers-examples-collapsing-retriever-results]

Expand Down
Loading