[Docs] kNN vector rescoring for quantized vectors (elastic#118425)

carlosdelest · carlosdelest · commit 6e929fa5854b · 2025-01-17T16:03:17.000Z
diff --git a/docs/reference/mapping/types/dense-vector.asciidoc b/docs/reference/mapping/types/dense-vector.asciidoc
@@ -121,11 +121,13 @@ The three following quantization strategies are supported:
 * `bbq` - experimental:[] Better binary quantization which reduces each dimension to a single bit precision. This reduces the memory footprint by 96% (or 32x) at a larger cost of accuracy. Generally, oversampling during query time and reranking can help mitigate the accuracy loss.
 
 
-When using a quantized format, you may want to oversample and rescore the results to improve accuracy. See <<dense-vector-knn-search-reranking, oversampling and rescoring>> for more information.
+When using a quantized format, you may want to oversample and rescore the results to improve accuracy. See <<dense-vector-knn-search-rescoring, oversampling and rescoring>> for more information.
 
 To use a quantized index, you can set your index type to `int8_hnsw`, `int4_hnsw`, or `bbq_hnsw`. When indexing `float` vectors, the current default
 index type is `int8_hnsw`.
 
+Quantized vectors can use <<dense-vector-knn-search-rescoring,oversampling and rescoring>> to improve accuracy on approximate kNN search results.
+
 NOTE: Quantization will continue to keep the raw float vector values on disk for reranking, reindexing, and quantization improvements over the lifetime of the data.
 This means disk usage will increase by ~25% for `int8`, ~12.5% for `int4`, and ~3.1% for `bbq` due to the overhead of storing the quantized and raw vectors.
 
diff --git a/docs/reference/query-dsl/knn-query.asciidoc b/docs/reference/query-dsl/knn-query.asciidoc
@@ -137,6 +137,9 @@ documents are then scored according to <<dense-vector-similarity, `similarity`>>
 and the provided `boost` is applied.
 --
 
+include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-rescore-vector]
+
+
 `boost`::
 +
 --
diff --git a/docs/reference/rest-api/common-parms.asciidoc b/docs/reference/rest-api/common-parms.asciidoc
@@ -1346,3 +1346,27 @@ tag::rrf-filter[]
 Applies the specified <<query-dsl-bool-query, boolean query filter>> to all of the specified sub-retrievers,
 according to each retriever's specifications.
 end::rrf-filter[]
+
+tag::knn-rescore-vector[]
+
+`rescore_vector`::
++
+--
+(Optional, object) Functionality in preview:[]. Apply oversampling and rescoring to quantized vectors.
+
+NOTE: Rescoring only makes sense for quantized vectors; when <<dense-vector-quantization,quantization>> is not used, the original vectors are used for scoring.
+Rescore option will be ignored for non-quantized `dense_vector` fields.
+
+`oversample`::
+(Required, float)
++
+Applies the specified oversample factor to `k` on the approximate kNN search.
+The approximate kNN search will:
+
+* Retrieve `num_candidates` candidates per shard.
+* From these candidates, the top `k * oversample` candidates per shard will be rescored using the original vectors.
+* The top `k` rescored candidates will be returned.
+
+See <<dense-vector-knn-search-rescoring,oversampling and rescoring quantized vectors>> for details.
+--
+end::knn-rescore-vector[]
diff --git a/docs/reference/search/retriever.asciidoc b/docs/reference/search/retriever.asciidoc
@@ -233,6 +233,8 @@ include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-filter]
 +
 include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-similarity]
 
+include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-rescore-vector]
+
 ===== Restrictions
 
 The parameters `query_vector` and `query_vector_builder` cannot be used together.
@@ -571,15 +573,15 @@ This examples demonstrates how to deploy the Elastic Rerank model and use it to
 
 Follow these steps:
 
-. Create an inference endpoint for the `rerank` task using the <<put-inference-api, Create {infer} API>>. 
+. Create an inference endpoint for the `rerank` task using the <<put-inference-api, Create {infer} API>>.
 +
 [source,console]
 ----
 PUT _inference/rerank/my-elastic-rerank
 {
   "service": "elasticsearch",
   "service_settings": {
-    "model_id": ".rerank-v1", 
+    "model_id": ".rerank-v1",
     "num_threads": 1,
     "adaptive_allocations": { <1>
       "enabled": true,
@@ -590,7 +592,7 @@ PUT _inference/rerank/my-elastic-rerank
 }
 ----
 // TEST[skip:uses ML]
-<1> {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[Adaptive allocations] will be enabled with the minimum of 1 and the maximum of 10 allocations. 
+<1> {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[Adaptive allocations] will be enabled with the minimum of 1 and the maximum of 10 allocations.
 +
 . Define a `text_similarity_rerank` retriever:
 +
diff --git a/docs/reference/search/search-your-data/knn-search.asciidoc b/docs/reference/search/search-your-data/knn-search.asciidoc
@@ -781,7 +781,7 @@ What if you wanted to filter by some top-level document metadata? You can do thi
 
 
 NOTE: `filter` will always be over the top-level document metadata. This means you cannot filter based on `nested`
-      field metadata.
+field metadata.
 
 [source,console]
 ----
@@ -1066,100 +1066,77 @@ NOTE: Approximate kNN search always uses the
 the global top `k` matches across shards. You cannot set the
 `search_type` explicitly when running kNN search.
 
+
 [discrete]
-[[exact-knn]]
-=== Exact kNN
+[[dense-vector-knn-search-rescoring]]
+==== Oversampling and rescoring for quantized vectors
 
-To run an exact kNN search, use a `script_score` query with a vector function.
+When using <<dense-vector-quantization,quantized vectors>> for kNN search, you can optionally rescore results to balance performance and accuracy, by doing:
 
-. Explicitly map one or more `dense_vector` fields. If you don't intend to use
-the field for approximate kNN, set the `index` mapping option to `false`. This
-can significantly improve indexing speed.
-+
-[source,console]
-----
-PUT product-index
-{
-  "mappings": {
-    "properties": {
-      "product-vector": {
-        "type": "dense_vector",
-        "dims": 5,
-        "index": false
-      },
-      "price": {
-        "type": "long"
-      }
-    }
-  }
-}
-----
+* *Oversampling*: Retrieve more candidates per shard.
+* *Rescoring*: Use the original vector values for re-calculating the score on the oversampled candidates.
 
-. Index your data.
-+
-[source,console]
-----
-POST product-index/_bulk?refresh=true
-{ "index": { "_id": "1" } }
-{ "product-vector": [230.0, 300.33, -34.8988, 15.555, -200.0], "price": 1599 }
-{ "index": { "_id": "2" } }
-{ "product-vector": [-0.5, 100.0, -13.0, 14.8, -156.0], "price": 799 }
-{ "index": { "_id": "3" } }
-{ "product-vector": [0.5, 111.3, -13.0, 14.8, -156.0], "price": 1099 }
-...
-----
-//TEST[continued]
-//TEST[s/\.\.\.//]
+As the non-quantized, original vectors are used to calculate the final score on the top results, rescoring combines:
+
+* The performance and memory gains of approximate retrieval using quantized vectors for retrieving the top candidates.
+* The accuracy of using the original vectors for rescoring the top candidates.
+
+All forms of quantization will result in some accuracy loss and as the quantization level increases the accuracy loss will also increase.
+Generally, we have found that:
+
+* `int8` requires minimal if any rescoring
+* `int4` requires some rescoring for higher accuracy and larger recall scenarios. Generally, oversampling by 1.5x-2x recovers most of the accuracy loss.
+* `bbq` requires rescoring except on exceptionally large indices or models specifically designed for quantization. We have found that between 3x-5x oversampling is generally sufficient. But for fewer dimensions or vectors that do not quantize well, higher oversampling may be required.
+
+You can use the `rescore_vector` preview:[] option to automatically perform reranking.
+When a rescore `oversample` parameter is specified, the approximate kNN search will:
+
+* Retrieve `num_candidates` candidates per shard.
+* From these candidates, the top `k * oversample` candidates per shard will be rescored using the original vectors.
+* The top `k` rescored candidates will be returned.
+
+Here is an example of using the `rescore_vector` option with the `oversample` parameter:
 
-. Use the <<search-search,search API>> to run a `script_score` query containing
-a <<vector-functions,vector function>>.
-+
-TIP: To limit the number of matched documents passed to the vector function, we
-recommend you specify a filter query in the `script_score.query` parameter. If
-needed, you can use a <<query-dsl-match-all-query,`match_all` query>> in this
-parameter to match all documents. However, matching all documents can
-significantly increase search latency.
-+
 [source,console]
 ----
-POST product-index/_search
+POST image-index/_search
 {
-  "query": {
-    "script_score": {
-      "query" : {
-        "bool" : {
-          "filter" : {
-            "range" : {
-              "price" : {
-                "gte": 1000
-              }
-            }
-          }
-        }
-      },
-      "script": {
-        "source": "cosineSimilarity(params.queryVector, 'product-vector') + 1.0",
-        "params": {
-          "queryVector": [-0.5, 90.0, -10, 14.8, -156.0]
-        }
-      }
+  "knn": {
+    "field": "image-vector",
+    "query_vector": [-5, 9, -12],
+    "k": 10,
+    "num_candidates": 100,
+    "rescore_vector": {
+      "oversample": 2.0
     }
-  }
+  },
+  "fields": [ "title", "file-type" ]
 }
 ----
 //TEST[continued]
+// TEST[s/"k": 10/"k": 3/]
+// TEST[s/"num_candidates": 100/"num_candidates": 3/]
+
+This example will:
+
+* Search using approximate kNN for the top 100 candidates.
+* Rescore the top 20 candidates (`oversample * k`) per shard using the original, non quantized vectors.
+* Return the top 10 (`k`) rescored candidates.
+* Merge the rescored canddidates from all shards, and return the top 10 (`k`) results.
 
 [discrete]
-[[dense-vector-knn-search-reranking]]
-==== Oversampling and rescoring for quantized vectors
+[[dense-vector-knn-search-rescoring-rescore-additional]]
+===== Additional rescoring techniques
 
-All forms of quantization will result in some accuracy loss and as the quantization level increases the accuracy loss will also increase.
-Generally, we have found that:
-- `int8` requires minimal if any rescoring
-- `int4` requires some rescoring for higher accuracy and larger recall scenarios. Generally, oversampling by 1.5x-2x recovers most of the accuracy loss.
-- `bbq` requires rescoring except on exceptionally large indices or models specifically designed for quantization. We have found that between 3x-5x oversampling is generally sufficient. But for fewer dimensions or vectors that do not quantize well, higher oversampling may be required.
+The following sections provide additional ways of rescoring:
+
+[discrete]
+[[dense-vector-knn-search-rescoring-rescore-section]]
+====== Use the `rescore` section for top-level kNN search
+
+You can use this option when you don't want to rescore on each shard, but on the top results from all shards.
 
-There are two main ways to oversample and rescore. The first is to utilize the <<rescore, rescore section>> in the `_search` request.
+Use the <<rescore, rescore section>> in the `_search` request to rescore the top results from a kNN search.
 
 Here is an example using the top level `knn` search with oversampling and using `rescore` to rerank the results:
 
@@ -1208,8 +1185,16 @@ gathering 20 nearest neighbors according to quantized scoring and rescoring with
 <5> The weight of the original query, here we simply throw away the original score
 <6> The weight of the rescore query, here we only use the rescore query
 
-The second way is to score per shard with the <<query-dsl-knn-query, knn query>> and <<query-dsl-script-score-query, script_score query >>. Generally, this means that there will be more rescoring per shard, but this
-can increase overall recall at the cost of compute.
+
+[discrete]
+[[dense-vector-knn-search-rescoring-script-score]]
+====== Use a `script_score` query to rescore per shard
+
+You can use this option when you want to rescore on each shard and want more fine-grained control on the rescoring
+than the `rescore_vector` option provides.
+
+Use rescore per shard with the <<query-dsl-knn-query, knn query>> and <<query-dsl-script-score-query, script_score query >>.
+Generally, this means that there will be more rescoring per shard, but this can increase overall recall at the cost of compute.
 
 [source,console]
 --------------------------------------------------
@@ -1241,3 +1226,87 @@ POST /my-index/_search
 <3> The number of candidates to use for the initial approximate `knn` search. This will search using the quantized vectors
 and return the top 20 candidates per shard to then be scored
 <4> The script to score the results. Script score will interact directly with the originally provided float32 vector.
+
+
+[discrete]
+[[exact-knn]]
+=== Exact kNN
+
+To run an exact kNN search, use a `script_score` query with a vector function.
+
+. Explicitly map one or more `dense_vector` fields. If you don't intend to use
+the field for approximate kNN, set the `index` mapping option to `false`. This
+can significantly improve indexing speed.
++
+[source,console]
+----
+PUT product-index
+{
+  "mappings": {
+    "properties": {
+      "product-vector": {
+        "type": "dense_vector",
+        "dims": 5,
+        "index": false
+      },
+      "price": {
+        "type": "long"
+      }
+    }
+  }
+}
+----
+
+. Index your data.
++
+[source,console]
+----
+POST product-index/_bulk?refresh=true
+{ "index": { "_id": "1" } }
+{ "product-vector": [230.0, 300.33, -34.8988, 15.555, -200.0], "price": 1599 }
+{ "index": { "_id": "2" } }
+{ "product-vector": [-0.5, 100.0, -13.0, 14.8, -156.0], "price": 799 }
+{ "index": { "_id": "3" } }
+{ "product-vector": [0.5, 111.3, -13.0, 14.8, -156.0], "price": 1099 }
+...
+----
+//TEST[continued]
+//TEST[s/\.\.\.//]
+
+. Use the <<search-search,search API>> to run a `script_score` query containing
+a <<vector-functions,vector function>>.
++
+TIP: To limit the number of matched documents passed to the vector function, we
+recommend you specify a filter query in the `script_score.query` parameter. If
+needed, you can use a <<query-dsl-match-all-query,`match_all` query>> in this
+parameter to match all documents. However, matching all documents can
+significantly increase search latency.
++
+[source,console]
+----
+POST product-index/_search
+{
+  "query": {
+    "script_score": {
+      "query" : {
+        "bool" : {
+          "filter" : {
+            "range" : {
+              "price" : {
+                "gte": 1000
+              }
+            }
+          }
+        }
+      },
+      "script": {
+        "source": "cosineSimilarity(params.queryVector, 'product-vector') + 1.0",
+        "params": {
+          "queryVector": [-0.5, 90.0, -10, 14.8, -156.0]
+        }
+      }
+    }
+  }
+}
+----
+//TEST[continued]
diff --git a/docs/reference/search/search.asciidoc b/docs/reference/search/search.asciidoc
@@ -534,6 +534,8 @@ not both. Refer to <<knn-semantic-search>> to learn more.
 (Optional, float)
 include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-similarity]
 
+include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-rescore-vector]
+
 ====
 
 [[search-api-min-score]]

Original file line number	Diff line number	Diff line change
@@ -137,6 +137,9 @@ documents are then scored according to <<dense-vector-similarity, `similarity`>>
`137`	`137`	and the provided `boost` is applied.
`138`	`138`	`--`
`139`	`139`
	`140`	`+include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-rescore-vector]`
	`141`	`+`
	`142`	`+`
`140`	`143`	`boost`::
`141`	`144`	`+`
`142`	`145`	`--`
Original file line number	Diff line number	Diff line change
`@@ -233,6 +233,8 @@ include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-filter]`
`233`	`233`	`+`
`234`	`234`	`include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-similarity]`
`235`	`235`
	`236`	`+include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=knn-rescore-vector]`
	`237`	`+`
`236`	`238`	`===== Restrictions`
`237`	`239`
`238`	`240`	The parameters `query_vector` and `query_vector_builder` cannot be used together.
`@@ -571,15 +573,15 @@ This examples demonstrates how to deploy the Elastic Rerank model and use it to`
`571`	`573`
`572`	`574`	`Follow these steps:`
`573`	`575`
`574`		-. Create an inference endpoint for the `rerank` task using the <<put-inference-api, Create {infer} API>>.
	`576`	+. Create an inference endpoint for the `rerank` task using the <<put-inference-api, Create {infer} API>>.
`575`	`577`	`+`
`576`	`578`	`[source,console]`
`577`	`579`	`----`
`578`	`580`	`PUT _inference/rerank/my-elastic-rerank`
`579`	`581`	`{`
`580`	`582`	`"service": "elasticsearch",`
`581`	`583`	`"service_settings": {`
`582`		`- "model_id": ".rerank-v1",`
	`584`	`+ "model_id": ".rerank-v1",`
`583`	`585`	`"num_threads": 1,`
`584`	`586`	`"adaptive_allocations": { <1>`
`585`	`587`	`"enabled": true,`
`@@ -590,7 +592,7 @@ PUT _inference/rerank/my-elastic-rerank`
`590`	`592`	`}`
`591`	`593`	`----`
`592`	`594`	`// TEST[skip:uses ML]`
`593`		`-<1> {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[Adaptive allocations] will be enabled with the minimum of 1 and the maximum of 10 allocations.`
	`595`	`+<1> {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[Adaptive allocations] will be enabled with the minimum of 1 and the maximum of 10 allocations.`
`594`	`596`	`+`
`595`	`597`	. Define a `text_similarity_rerank` retriever:
`596`	`598`	`+`