Skip to content
This repository was archived by the owner on Aug 16, 2022. It is now read-only.

Commit ed9b000

Browse files
committed
Initial draft for custom scoring
1 parent 81a9cb3 commit ed9b000

File tree

2 files changed

+117
-15
lines changed

2 files changed

+117
-15
lines changed

docs/knn/index.md

Lines changed: 112 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,10 @@ Short for its associated *k-nearest neighbors* algorithm, the KNN plugin lets yo
1313

1414
## Get started
1515

16-
To use the KNN plugin, you must create an index with the `index.knn` setting and add one or more fields of the `knn_vector` data type. Additionally, you can specify the `index.knn.space_type` with `l2` or `cosinesimil`, respectively, to use either Euclidean distance or cosine similarity for calculations. By default, `index.knn.space_type` is set to `l2`. Here is an example that creates an index with two knn_vector fields and uses cosine similarity:
16+
To use the KNN query type, you must create an index with the `index.knn` setting and add one or more fields of the `knn_vector` data type. Additionally, you can specify the `index.knn.space_type` with `l2` or `cosinesimil` to use, respectively, either Euclidean distance or cosine similarity for calculations. By default, `index.knn.space_type` is `l2`. Here is an example that creates an index with two knn_vector fields and uses cosine similarity:
1717

1818
```json
19-
PUT my-index
19+
PUT my-knn-index-1
2020
{
2121
"settings": {
2222
"index": {
@@ -48,31 +48,31 @@ After you create the index, add some data to it:
4848

4949
```json
5050
POST _bulk
51-
{ "index": { "_index": "my-index", "_id": "1" } }
51+
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
5252
{ "my_vector1": [1.5, 2.5], "price": 12.2 }
53-
{ "index": { "_index": "my-index", "_id": "2" } }
53+
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
5454
{ "my_vector1": [2.5, 3.5], "price": 7.1 }
55-
{ "index": { "_index": "my-index", "_id": "3" } }
55+
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
5656
{ "my_vector1": [3.5, 4.5], "price": 12.9 }
57-
{ "index": { "_index": "my-index", "_id": "4" } }
57+
{ "index": { "_index": "my-knn-index-1", "_id": "4" } }
5858
{ "my_vector1": [5.5, 6.5], "price": 1.2 }
59-
{ "index": { "_index": "my-index", "_id": "5" } }
59+
{ "index": { "_index": "my-knn-index-1", "_id": "5" } }
6060
{ "my_vector1": [4.5, 5.5], "price": 3.7 }
61-
{ "index": { "_index": "my-index", "_id": "6" } }
61+
{ "index": { "_index": "my-knn-index-1", "_id": "6" } }
6262
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 10.3 }
63-
{ "index": { "_index": "my-index", "_id": "7" } }
63+
{ "index": { "_index": "my-knn-index-1", "_id": "7" } }
6464
{ "my_vector2": [2.5, 3.5, 5.6, 6.7], "price": 5.5 }
65-
{ "index": { "_index": "my-index", "_id": "8" } }
65+
{ "index": { "_index": "my-knn-index-1", "_id": "8" } }
6666
{ "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
67-
{ "index": { "_index": "my-index", "_id": "9" } }
67+
{ "index": { "_index": "my-knn-index-1", "_id": "9" } }
6868
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }
6969

7070
```
7171

7272
Then you can search the data using the `knn` query type:
7373

7474
```json
75-
GET my-index/_search
75+
GET my-knn-index-1/_search
7676
{
7777
"size": 2,
7878
"query": {
@@ -88,10 +88,13 @@ GET my-index/_search
8888

8989
In this case, `k` is the number of neighbors you want the query to return, but you must also include the `size` option. Otherwise, you get `k` results for each shard (and each segment) rather than `k` results for the entire query. The plugin supports a maximum `k` value of 10,000.
9090

91-
If you mix the `knn` query with other clauses, you might receive fewer than `k` results. In this example, the `post_filter` clause reduces the number of results from 2 to 1:
91+
92+
## Mixing queries
93+
94+
If you mix the `knn` query with filters or other queries, you might receive fewer than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1:
9295

9396
```json
94-
GET my-index/_search
97+
GET my-knn-index-1/_search
9598
{
9699
"size": 2,
97100
"query": {
@@ -112,3 +115,98 @@ GET my-index/_search
112115
}
113116
}
114117
```
118+
119+
120+
## Custom scoring
121+
122+
The [previous example](#mixing-queries) shows a search that returns fewer than `k` results. If you want to avoid this situation, KNN's custom scoring option lets you essentially invert the order of events.
123+
124+
First, add another index:
125+
126+
```json
127+
PUT my-knn-index-2
128+
{
129+
"settings": {
130+
"index.knn": true
131+
},
132+
"mappings": {
133+
"properties": {
134+
"my_vector": {
135+
"type": "knn_vector",
136+
"dimension": 2
137+
},
138+
"color": {
139+
"type": "keyword"
140+
}
141+
}
142+
}
143+
}
144+
```
145+
146+
If you *only* want to use KNN's custom scoring, you can omit `"index.knn": true`, but you lose the ability to perform standard KNN queries on the index. The benefit of this approach is faster indexing speed and lower memory usage.
147+
{: .tip}
148+
149+
Then add some documents:
150+
151+
```json
152+
POST _bulk
153+
{ "index": { "_index": "my-knn-index-2", "_id": "1" } }
154+
{ "my_vector": [1, 1], "color" : "RED" }
155+
{ "index": { "_index": "my-knn-index-2", "_id": "2" } }
156+
{ "my_vector": [2, 2], "color" : "RED" }
157+
{ "index": { "_index": "my-knn-index-2", "_id": "3" } }
158+
{ "my_vector": [3, 3], "color" : "RED" }
159+
{ "index": { "_index": "my-knn-index-2", "_id": "4" } }
160+
{ "my_vector": [10, 10], "color" : "BLUE" }
161+
{ "index": { "_index": "my-knn-index-2", "_id": "5" } }
162+
{ "my_vector": [20, 20], "color" : "BLUE" }
163+
{ "index": { "_index": "my-knn-index-2", "_id": "6" } }
164+
{ "my_vector": [30, 30], "color" : "BLUE" }
165+
166+
```
167+
168+
Finally, use the `script_store` query to pre-filter your documents before identifying nearest neighbors:
169+
170+
```json
171+
GET my-knn-index-2/_search
172+
{
173+
"size": 2,
174+
"query": {
175+
"script_score": {
176+
"query": {
177+
"bool": {
178+
"filter": {
179+
"term": {
180+
"color": "BLUE"
181+
}
182+
}
183+
}
184+
},
185+
"script": {
186+
"lang": "knn",
187+
"source": "knn_score",
188+
"params": {
189+
"field": "my_vector",
190+
"vector": [9.9, 9.9],
191+
"space_type": "l2"
192+
}
193+
}
194+
}
195+
}
196+
}
197+
```
198+
199+
All options are required.
200+
201+
- `lang` is the script type. This value is usually `painless`, but here you must specify `knn`.
202+
- `source` is the name of the stored script, `knn_store`.
203+
- `field` is the field that contains your vector data.
204+
- `vector` is the point you want to find the nearest neighbors for.
205+
- `space_type` is either `l2` or `cosinesimil`.
206+
207+
208+
## Performance considerations
209+
210+
The standard KNN query and custom scoring option have performance tradeoffs. You should test both using a representative set of documents to see if the search results and latencies match your expectations.
211+
212+
In general, larger `k` values benefit from the standard KNN query. If you have a smaller `k` value and expect the initial pre-filter to reduce the number of documents to the thousands (not millions), custom scoring can work well.

docs/knn/settings.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ parent: KNN
55
nav_order: 10
66
---
77

8-
# KNN Settings and Statistics
8+
# KNN Settings and statistics
99

1010
The KNN plugin adds several new index settings, cluster settings, and statistics.
1111

@@ -60,3 +60,7 @@ Statistic | Description
6060
`graphMemoryUsage` | Current cache size (total size of all graphs in memory) in kilobytes.
6161
`missCount` | The number of cache misses. A cache miss occurs when a user queries a graph and it has not yet been loaded into memory.
6262
`loadExceptionCount` | The number of times an exception occurred when trying to load a graph into the cache.
63+
`script_compilations` | The number of times the KNN script has been compiled. This value should usually be 1 or 0, but if the cache containing the compiled scripts is filled, the KNN script might be recompiled.
64+
`script_compilation_errors` | The number of errors during script compilation.
65+
`script_query_requests` | The number of query requests that use [the KNN script](../#custom-scoring).
66+
`script_query_errors` | The number of errors during script queries.

0 commit comments

Comments
 (0)