Initial draft for custom scoring

aetter · aetter · commit ed9b0001dbb0 · 2020-10-20T16:42:46.000-07:00
diff --git a/docs/knn/index.md b/docs/knn/index.md
@@ -13,10 +13,10 @@ Short for its associated *k-nearest neighbors* algorithm, the KNN plugin lets yo
 
 ## Get started
 
-To use the KNN plugin, you must create an index with the `index.knn` setting and add one or more fields of the `knn_vector` data type. Additionally, you can specify the `index.knn.space_type` with `l2` or `cosinesimil`, respectively, to use either Euclidean distance or cosine similarity for calculations. By default, `index.knn.space_type` is set to `l2`. Here is an example that creates an index with two knn_vector fields and uses cosine similarity:
+To use the KNN query type, you must create an index with the `index.knn` setting and add one or more fields of the `knn_vector` data type. Additionally, you can specify the `index.knn.space_type` with `l2` or `cosinesimil` to use, respectively, either Euclidean distance or cosine similarity for calculations. By default, `index.knn.space_type` is `l2`. Here is an example that creates an index with two knn_vector fields and uses cosine similarity:
 
 ```json
-PUT my-index
+PUT my-knn-index-1
 {
   "settings": {
     "index": {
@@ -48,31 +48,31 @@ After you create the index, add some data to it:
 
 ```json
 POST _bulk
-{ "index": { "_index": "my-index", "_id": "1" } }
+{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
 { "my_vector1": [1.5, 2.5], "price": 12.2 }
-{ "index": { "_index": "my-index", "_id": "2" } }
+{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
 { "my_vector1": [2.5, 3.5], "price": 7.1 }
-{ "index": { "_index": "my-index", "_id": "3" } }
+{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
 { "my_vector1": [3.5, 4.5], "price": 12.9 }
-{ "index": { "_index": "my-index", "_id": "4" } }
+{ "index": { "_index": "my-knn-index-1", "_id": "4" } }
 { "my_vector1": [5.5, 6.5], "price": 1.2 }
-{ "index": { "_index": "my-index", "_id": "5" } }
+{ "index": { "_index": "my-knn-index-1", "_id": "5" } }
 { "my_vector1": [4.5, 5.5], "price": 3.7 }
-{ "index": { "_index": "my-index", "_id": "6" } }
+{ "index": { "_index": "my-knn-index-1", "_id": "6" } }
 { "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 10.3 }
-{ "index": { "_index": "my-index", "_id": "7" } }
+{ "index": { "_index": "my-knn-index-1", "_id": "7" } }
 { "my_vector2": [2.5, 3.5, 5.6, 6.7], "price": 5.5 }
-{ "index": { "_index": "my-index", "_id": "8" } }
+{ "index": { "_index": "my-knn-index-1", "_id": "8" } }
 { "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
-{ "index": { "_index": "my-index", "_id": "9" } }
+{ "index": { "_index": "my-knn-index-1", "_id": "9" } }
 { "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }
 
 ```
 
 Then you can search the data using the `knn` query type:
 
 ```json
-GET my-index/_search
+GET my-knn-index-1/_search
 {
   "size": 2,
   "query": {
@@ -88,10 +88,13 @@ GET my-index/_search
 
 In this case, `k` is the number of neighbors you want the query to return, but you must also include the `size` option. Otherwise, you get `k` results for each shard (and each segment) rather than `k` results for the entire query. The plugin supports a maximum `k` value of 10,000.
 
-If you mix the `knn` query with other clauses, you might receive fewer than `k` results. In this example, the `post_filter` clause reduces the number of results from 2 to 1:
+
+## Mixing queries
+
+If you mix the `knn` query with filters or other queries, you might receive fewer than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1:
 
 ```json
-GET my-index/_search
+GET my-knn-index-1/_search
 {
   "size": 2,
   "query": {
@@ -112,3 +115,98 @@ GET my-index/_search
   }
 }
 ```
+
+
+## Custom scoring
+
+The [previous example](#mixing-queries) shows a search that returns fewer than `k` results. If you want to avoid this situation, KNN's custom scoring option lets you essentially invert the order of events.
+
+First, add another index:
+
+```json
+PUT my-knn-index-2
+{
+  "settings": {
+    "index.knn": true
+  },
+  "mappings": {
+    "properties": {
+      "my_vector": {
+        "type": "knn_vector",
+        "dimension": 2
+      },
+      "color": {
+        "type": "keyword"
+      }
+    }
+  }
+}
+```
+
+If you *only* want to use KNN's custom scoring, you can omit `"index.knn": true`, but you lose the ability to perform standard KNN queries on the index. The benefit of this approach is faster indexing speed and lower memory usage.
+{: .tip}
+
+Then add some documents:
+
+```json
+POST _bulk
+{ "index": { "_index": "my-knn-index-2", "_id": "1" } }
+{ "my_vector": [1, 1], "color" : "RED" }
+{ "index": { "_index": "my-knn-index-2", "_id": "2" } }
+{ "my_vector": [2, 2], "color" : "RED" }
+{ "index": { "_index": "my-knn-index-2", "_id": "3" } }
+{ "my_vector": [3, 3], "color" : "RED" }
+{ "index": { "_index": "my-knn-index-2", "_id": "4" } }
+{ "my_vector": [10, 10], "color" : "BLUE" }
+{ "index": { "_index": "my-knn-index-2", "_id": "5" } }
+{ "my_vector": [20, 20], "color" : "BLUE" }
+{ "index": { "_index": "my-knn-index-2", "_id": "6" } }
+{ "my_vector": [30, 30], "color" : "BLUE" }
+
+```
+
+Finally, use the `script_store` query to pre-filter your documents before identifying nearest neighbors:
+
+```json
+GET my-knn-index-2/_search
+{
+  "size": 2,
+  "query": {
+    "script_score": {
+      "query": {
+        "bool": {
+          "filter": {
+            "term": {
+              "color": "BLUE"
+            }
+          }
+        }
+      },
+      "script": {
+        "lang": "knn",
+        "source": "knn_score",
+        "params": {
+          "field": "my_vector",
+          "vector": [9.9, 9.9],
+          "space_type": "l2"
+        }
+      }
+    }
+  }
+}
+```
+
+All options are required.
+
+- `lang` is the script type. This value is usually `painless`, but here you must specify `knn`.
+- `source` is the name of the stored script, `knn_store`.
+- `field` is the field that contains your vector data.
+- `vector` is the point you want to find the nearest neighbors for.
+- `space_type` is either `l2` or `cosinesimil`.
+
+
+## Performance considerations
+
+The standard KNN query and custom scoring option have performance tradeoffs. You should test both using a representative set of documents to see if the search results and latencies match your expectations.
+
+In general, larger `k` values benefit from the standard KNN query. If you have a smaller `k` value and expect the initial pre-filter to reduce the number of documents to the thousands (not millions), custom scoring can work well.
diff --git a/docs/knn/settings.md b/docs/knn/settings.md
@@ -5,7 +5,7 @@ parent: KNN
 nav_order: 10
 ---
 
-# KNN Settings and Statistics
+# KNN Settings and statistics
 
 The KNN plugin adds several new index settings, cluster settings, and statistics.
 
@@ -60,3 +60,7 @@ Statistic |  Description
 `graphMemoryUsage` | Current cache size (total size of all graphs in memory) in kilobytes.
 `missCount` | The number of cache misses. A cache miss occurs when a user queries a graph and it has not yet been loaded into memory.
 `loadExceptionCount` | The number of times an exception occurred when trying to load a graph into the cache.
+`script_compilations` | The number of times the KNN script has been compiled. This value should usually be 1 or 0, but if the cache containing the compiled scripts is filled, the KNN script might be recompiled.
+`script_compilation_errors` | The number of errors during script compilation.
+`script_query_requests` | The number of query requests that use [the KNN script](../#custom-scoring).
+`script_query_errors` | The number of errors during script queries.