Skip to content
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
979e34c
Add an inference metadata fields instead of storing the inference in …
jimczi Nov 21, 2024
08f58e7
Merge remote-tracking branch 'upstream/main' into inference_metadata_…
jimczi Nov 28, 2024
8f5e234
iter
jimczi Nov 29, 2024
78ff84f
Merge remote-tracking branch 'upstream/main' into inference_metadata_…
jimczi Nov 29, 2024
1d1e819
iter
jimczi Nov 30, 2024
4f29bd8
iter
jimczi Nov 30, 2024
50222ed
spotless
jimczi Nov 30, 2024
5ab3e35
iter
jimczi Nov 30, 2024
9acf74d
iter
jimczi Nov 30, 2024
7bcd2b1
Merge remote-tracking branch 'upstream/main' into inference_metadata_…
jimczi Nov 30, 2024
cdae3cf
iter
jimczi Dec 1, 2024
fe60268
iter
jimczi Dec 1, 2024
184174f
Merge remote-tracking branch 'upstream/main' into inference_metadata_…
jimczi Dec 1, 2024
f204cc3
iter
jimczi Dec 2, 2024
d710efc
iter
jimczi Dec 2, 2024
b0a15cc
Merge remote-tracking branch 'upstream/main' into inference_metadata_…
jimczi Dec 2, 2024
22db126
iter
jimczi Dec 2, 2024
825d626
iter
jimczi Dec 2, 2024
2cb7963
iter
jimczi Dec 2, 2024
bcce2ea
Remove value fetcher as it also retrieves copy_to fields
jimczi Dec 2, 2024
d1b8d61
Merge remote-tracking branch 'upstream/main' into inference_metadata_…
jimczi Dec 2, 2024
9deee7e
Merge branch 'main' into inference_metadata_fields
Mikep86 Dec 5, 2024
1e76873
Merge branch 'main' into inference_metadata_fields
Mikep86 Dec 9, 2024
9b36f55
Merge branch 'main' into inference_metadata_fields
Mikep86 Dec 10, 2024
14eeb27
Merge branch 'main' into inference_metadata_fields
Mikep86 Dec 10, 2024
cc0f394
Merge branch 'main' into inference_metadata_fields
Mikep86 Dec 11, 2024
3117cf6
Support `_shard_doc` as a sort tiebreaker for query rescoring
jimczi Dec 12, 2024
9feae33
Add a new `rescorer` retriever
jimczi Dec 12, 2024
dd0d830
Update docs/changelog/118585.yaml
jimczi Dec 12, 2024
de13467
Add inference metadata fields feature flag and fix texts (#118190)
Mikep86 Dec 12, 2024
5966d1c
remove yaml test now that _shard_doc can be used without a PIT
jimczi Dec 13, 2024
768d716
Add downstream validation of the rank_window_size in the compound ret…
jimczi Dec 13, 2024
9634945
address review comments
jimczi Dec 13, 2024
54458e3
add missing size validation
jimczi Dec 13, 2024
c9d21f5
fix license header
jimczi Dec 13, 2024
61f025e
Merge branch 'main' into inference_metadata_fields
Mikep86 Dec 13, 2024
c0c3b08
Merge remote-tracking branch 'upstream/main' into rescorer_retriever
jimczi Dec 16, 2024
f0c4236
fix doc
jimczi Dec 16, 2024
44a5132
[CI] Auto commit changes from spotless
Dec 16, 2024
71c4823
fix tests
jimczi Dec 16, 2024
fa45c50
Merge branch 'main' into inference_metadata_fields
Mikep86 Dec 16, 2024
cb86fd4
Inference Metadata Fields - Chunk On Delimiter (#118694)
Mikep86 Dec 16, 2024
e80fca1
Merge branch 'main' into inference_metadata_fields
Mikep86 Dec 17, 2024
80c0bae
Merge remote-tracking branch 'upstream/main' into rescorer_retriever
jimczi Dec 17, 2024
b1ddf79
fix wrong link in docs
jimczi Dec 17, 2024
9533c7b
Merge branch 'main' into inference_metadata_fields
Mikep86 Dec 17, 2024
3a3e8cf
Update docs/reference/search/retriever.asciidoc
jimczi Dec 18, 2024
3f421ef
Merge remote-tracking branch 'upstream/inference_metadata_fields' int…
jimczi Dec 18, 2024
6e2a44c
improve error message related to window size
jimczi Dec 18, 2024
582d88a
Merge remote-tracking branch 'origin/rescorer_retriever' into rescore…
jimczi Dec 18, 2024
8d21423
Revert "Merge remote-tracking branch 'upstream/inference_metadata_fie…
jimczi Dec 18, 2024
fd66bd8
Merge remote-tracking branch 'upstream/main' into rescorer_retriever
jimczi Dec 18, 2024
84dcda0
revert changes after wrong merge
jimczi Dec 18, 2024
2ef709a
revert
jimczi Dec 18, 2024
e754439
Merge remote-tracking branch 'upstream/main' into rescorer_retriever
jimczi Dec 18, 2024
701066e
adapt error message
jimczi Dec 18, 2024
ce1d2d5
fix error message (bis)
jimczi Dec 18, 2024
fd40ce2
[CI] Auto commit changes from spotless
Dec 18, 2024
bfad0ea
Merge remote-tracking branch 'upstream/main' into rescorer_retriever
jimczi Dec 18, 2024
9ea3fcc
Merge remote-tracking branch 'origin/rescorer_retriever' into rescore…
jimczi Dec 18, 2024
653e855
fix error message (part 3)
jimczi Dec 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/changelog/118585.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
pr: 118585
summary: Add a generic `rescorer` retriever based on the search request's rescore
functionality
area: Ranking
type: feature
issues:
- 118327
119 changes: 118 additions & 1 deletion docs/reference/search/retriever.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ A <<standard-retriever, retriever>> that replaces the functionality of a traditi
`knn`::
A <<knn-retriever, retriever>> that replaces the functionality of a <<search-api-knn, knn search>>.

`rescorer`::
A <<rescorer-retriever, retriever>> that replaces the functionality of the <<rescore, query rescorer>>.

`rrf`::
A <<rrf-retriever, retriever>> that produces top documents from <<rrf, reciprocal rank fusion (RRF)>>.

Expand Down Expand Up @@ -371,6 +374,120 @@ GET movies/_search
----
// TEST[skip:uses ELSER]

[[rescorer-retriever]]
==== Rescorer Retriever

The `rescorer` retriever re-scores only the results produced by its child retriever.
For the `standard` and `knn` retrievers, the `window_size` parameter specifies the number of documents examined per shard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we name this window_size to be in sync with the top-level rescore API? We have made a distinction on that and retrievers (where the top-level documents that a retriever operates upon are specified through rank_window_size), but I can see how that would make it more straightforward for users to move to the new framework.

Either is fine though I guess :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I am re-using the rescorer builder parser to ease the integration. I like window_size better, what do you think of accepting both in the CompoundRetrieverBuilder? I can rename in the documentation to consistently refer to window_size while still accepting rank_window_size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I thinks its fine accepting either argument but I'm not so sure about the documentation changes tbh. This was renamed in #106253, to differentiate at the time between rescore's window size (as one is applied on the shards and pagination/expected results could behave slightly differently). I don't have a strong preference however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also fine keeping the distinction. The window_size here refers to rescorer so it would be inconsistent to name it rank_window_size while the rescorer doc keeps the other name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, ++ to keep it as window_size just for the rescorer retriever. (should we also accept rankWindowSize and/or properly handle the param references in CompoundRetrieverBuilder#validate ) ?


For compound retrievers like `rrf`, the `window_size` parameter defines the total number of documents examined globally.

When using the `rescorer`, an error is returned if the following conditions are not met:

* The minimum configured rescore's `window_size` is:
** Greater than or equal to the `size` of the parent retriever for nested `rescorer` setups.
** Greater than or equal to the `size` of the search request when used as the primary retriever in the tree.

* And the maximum rescore's `window_size` is:
** Smaller than or equal to the `size` or `rank_window_size` of the child retriever.

===== Parameters

`rescore`::
(Required. <<rescore, A rescorer definition or an array of rescorer definitions>>)
+
Defines the <<rescore, rescorers>> applied sequentially to the top documents returned by the child retriever.

`retriever`::
(Required. <<retriever, retriever>>)
+
Specifies the child retriever responsible for generating the initial set of top documents to be re-ranked.

`filter`::
(Optional. <<query-dsl, query object or list of query objects>>)
+
Applies a <<query-dsl-bool-query, boolean query filter>> to the retriever, ensuring that all documents match the filter criteria without affecting their scores.

[discrete]
[[rescorer-retriever-example]]
==== Example

The `rescorer` retriever can be placed at any level within the retriever tree.
The following example demonstrates a `rescorer` applied to the results produced by an `rrf` retriever:

[source,console]
----
GET movies/_search
{
"size": 10, <1>
"retriever": {
"rescorer": { <2>
"rescore": {
"query": { <3>
"window_size": 50, <4>
"rescore_query": {
"script_score": {
"script": {
"source": "cosineSimilarity(params.queryVector, 'product-vector_final_stage') + 1.0",
"params": {
"queryVector": [-0.5, 90.0, -10, 14.8, -156.0]
}
}
}
}
}
},
"retriever": { <5>
"rrf": {
"rank_window_size": 100, <6>
"retrievers": [
{
"standard": {
"query": {
"sparse_vector": {
"field": "plot_embedding",
"inference_id": "my-elser-model",
"query": "films that explore psychological depths"
}
}
}
},
{
"standard": {
"query": {
"multi_match": {
"query": "crime",
"fields": [
"plot",
"title"
]
}
}
}
},
{
"knn": {
"field": "vector",
"query_vector": [10, 22, 77],
"k": 10,
"num_candidates": 10
}
}
]
}
}
}
}
}
----
// TEST[skip:uses ELSER]
<1> Specifies the number of top documents to return in the final response.
<2> A `rescorer` retriever applied as the final step.
<3> The definition of the `query` rescorer.
<4> Defines the number of documents to rescore from the child retriever.
<5> Specifies the child retriever definition.
<6> Defines the number of documents returned by the `rrf` retriever, which limits the available documents to

[[text-similarity-reranker-retriever]]
==== Text Similarity Re-ranker Retriever

Expand Down Expand Up @@ -777,4 +894,4 @@ When a retriever is specified as part of a search, the following elements are no
* <<search-after, `search_after`>>
* <<request-body-search-terminate-after, `terminate_after`>>
* <<search-sort-param, `sort`>>
* <<rescore, `rescore`>>
* <<rescore, `rescore`>> use a <<rescorer-retriever, rescorer retriever>> instead
1 change: 1 addition & 0 deletions rest-api-spec/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,5 @@ tasks.named("yamlRestCompatTestTransform").configure ({ task ->
task.skipTest("search.vectors/41_knn_search_bbq_hnsw/Test knn search", "Scoring has changed in latest versions")
task.skipTest("search.vectors/42_knn_search_bbq_flat/Test knn search", "Scoring has changed in latest versions")
task.skipTest("synonyms/90_synonyms_reloading_for_synset/Reload analyzers for specific synonym set", "Can't work until auto-expand replicas is 0-1 for synonyms index")
task.skipTest("search/90_search_after/_shard_doc sort", "restriction has been lifted in latest versions")
})
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
setup:
- requires:
cluster_features: [ "search.retriever.rescorer.enabled" ]
reason: "Support for rescorer retriever"

- do:
indices.create:
index: test
body:
settings:
number_of_shards: 1
number_of_replicas: 0
mappings:
properties:
available:
type: boolean
features:
type: rank_features

- do:
bulk:
refresh: true
index: test
body:
- '{"index": {"_id": 1 }}'
- '{"features": { "first_stage": 1, "second_stage": 10}, "available": true, "group": 1}'
- '{"index": {"_id": 2 }}'
- '{"features": { "first_stage": 2, "second_stage": 9}, "available": false, "group": 1}'
- '{"index": {"_id": 3 }}'
- '{"features": { "first_stage": 3, "second_stage": 8}, "available": false, "group": 3}'
- '{"index": {"_id": 4 }}'
- '{"features": { "first_stage": 4, "second_stage": 7}, "available": true, "group": 1}'
- '{"index": {"_id": 5 }}'
- '{"features": { "first_stage": 5, "second_stage": 6}, "available": true, "group": 3}'
- '{"index": {"_id": 6 }}'
- '{"features": { "first_stage": 6, "second_stage": 5}, "available": false, "group": 2}'
- '{"index": {"_id": 7 }}'
- '{"features": { "first_stage": 7, "second_stage": 4}, "available": true, "group": 3}'
- '{"index": {"_id": 8 }}'
- '{"features": { "first_stage": 8, "second_stage": 3}, "available": true, "group": 1}'
- '{"index": {"_id": 9 }}'
- '{"features": { "first_stage": 9, "second_stage": 2}, "available": true, "group": 2}'
- '{"index": {"_id": 10 }}'
- '{"features": { "first_stage": 10, "second_stage": 1}, "available": false, "group": 1}'

---
"Rescorer retriever basic":
- do:
search:
index: test
body:
retriever:
rescorer:
rescore:
window_size: 10
query:
rescore_query:
rank_feature:
field: "features.second_stage"
linear: { }
query_weight: 0
retriever:
standard:
query:
rank_feature:
field: "features.first_stage"
linear: { }
size: 2

- match: { hits.total.value: 10 }
- match: { hits.hits.0._id: "1" }
- match: { hits.hits.0._score: 10.0 }
- match: { hits.hits.1._id: "2" }
- match: { hits.hits.1._score: 9.0 }

- do:
search:
index: test
body:
retriever:
rescorer:
rescore:
window_size: 3
query:
rescore_query:
rank_feature:
field: "features.second_stage"
linear: {}
query_weight: 0
retriever:
standard:
query:
rank_feature:
field: "features.first_stage"
linear: {}
size: 2

- match: {hits.total.value: 10}
- match: {hits.hits.0._id: "8"}
- match: { hits.hits.0._score: 3.0 }
- match: {hits.hits.1._id: "9"}
- match: { hits.hits.1._score: 2.0 }

---
"Rescorer retriever with pre-filters":
- do:
search:
index: test
body:
retriever:
rescorer:
filter:
match:
available: true
rescore:
window_size: 10
query:
rescore_query:
rank_feature:
field: "features.second_stage"
linear: { }
query_weight: 0
retriever:
standard:
query:
rank_feature:
field: "features.first_stage"
linear: { }
size: 2

- match: { hits.total.value: 6 }
- match: { hits.hits.0._id: "1" }
- match: { hits.hits.0._score: 10.0 }
- match: { hits.hits.1._id: "4" }
- match: { hits.hits.1._score: 7.0 }

- do:
search:
index: test
body:
retriever:
rescorer:
rescore:
window_size: 4
query:
rescore_query:
rank_feature:
field: "features.second_stage"
linear: { }
query_weight: 0
retriever:
standard:
filter:
match:
available: true
query:
rank_feature:
field: "features.first_stage"
linear: { }
size: 2

- match: { hits.total.value: 6 }
- match: { hits.hits.0._id: "5" }
- match: { hits.hits.0._score: 6.0 }
- match: { hits.hits.1._id: "7" }
- match: { hits.hits.1._score: 4.0 }

---
"Rescorer retriever and collapsing":
- do:
search:
index: test
body:
retriever:
rescorer:
rescore:
window_size: 10
query:
rescore_query:
rank_feature:
field: "features.second_stage"
linear: { }
query_weight: 0
retriever:
standard:
query:
rank_feature:
field: "features.first_stage"
linear: { }
collapse:
field: group
size: 3

- match: { hits.total.value: 10 }
- match: { hits.hits.0._id: "1" }
- match: { hits.hits.0._score: 10.0 }
- match: { hits.hits.1._id: "3" }
- match: { hits.hits.1._score: 8.0 }
- match: { hits.hits.2._id: "6" }
- match: { hits.hits.2._score: 5.0 }

---
"Rescorer retriever and invalid window size":
- do:
catch: "/\\[rescorer\\] requires \\[rank_window_size: 5\\] be greater than or equal to \\[size: 10\\]/"
search:
index: test
body:
retriever:
rescorer:
rescore:
window_size: 5
query:
rescore_query:
rank_feature:
field: "features.second_stage"
linear: { }
query_weight: 0
retriever:
standard:
query:
rank_feature:
field: "features.first_stage"
linear: { }
size: 10
Loading