Skip to content
This repository was archived by the owner on Aug 16, 2022. It is now read-only.

Commit a5cf609

Browse files
committed
update knn docs
1 parent 5ae56e1 commit a5cf609

File tree

8 files changed

+22
-22
lines changed

8 files changed

+22
-22
lines changed

docs/knn/api.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
layout: default
33
title: API
44
nav_order: 4
5-
parent: KNN
5+
parent: k-NN
66
has_children: false
77
---
88

@@ -38,7 +38,7 @@ Statistic | Description
3838
`script_query_requests` | The total number of script queries. This is only relevant to k-NN score script search.
3939
`script_query_errors` | The number of errors during script queries. This is only relevant to k-NN score script search.
4040

41-
### Examples
41+
### Usage
4242
```
4343
4444
GET /_opendistro/_knn/stats?pretty

docs/knn/approximate-knn.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
layout: default
33
title: Approximate Search
44
nav_order: 1
5-
parent: KNN
5+
parent: k-NN
66
has_children: false
77
has_math: true
88
---
@@ -11,13 +11,13 @@ has_math: true
1111

1212
The approximate k-NN method uses [nmslib's](https://github.com/nmslib/nmslib/) implementation of the HNSW algorithm to power k-NN search. In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest neighbors. Of the three methods, this method offers the best search scalability for large data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach should be preferred.
1313

14-
This plugin builds an HNSW graph of the vectors for each "knn-vector field"/"Lucene segment" pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. These graphs are loaded into native memory during search and managed by a cache. To pre-load the graphs into memory, please refer to the [warmup API](../api#Warmup). In order to see what graphs are loaded in memory as well as other stats, please refer to the [stats API](../api#Stats). To learn more about segments, please refer to [Apache Lucene's documentation](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description). Because the graphs are constructed during indexing, it is not possible to apply a filter on an index and then use this search method. All filters will be applied on the results produced by the approximate nearest neighbor search.
14+
This plugin builds an HNSW graph of the vectors for each "knn-vector field"/"Lucene segment" pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. These graphs are loaded into native memory during search and managed by a cache. To pre-load the graphs into memory, please refer to the [warmup API](api#Warmup). In order to see what graphs are loaded in memory as well as other stats, please refer to the [stats API](api#Stats). To learn more about segments, please refer to [Apache Lucene's documentation](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description). Because the graphs are constructed during indexing, it is not possible to apply a filter on an index and then use this search method. All filters will be applied on the results produced by the approximate nearest neighbor search.
1515

1616
## Get started with approximate k-NN
1717

1818
To use the k-NN plugin's approximate search functionality, you must first create a k-NN index with the index setting, `index.knn` to `true`. This setting tells the plugin to create HNSW graphs for the index.
1919

20-
Additionally, if you are using the approximate k-nearest neighbor method, you should specify `knn.space_type` to the space that you are interested in. This setting cannot be changed after it is set. Please refer to the [spaces section](#spaces) to see what spaces we support! By default, `index.knn.space_type` is `l2`. For more information on index settings, such as algorithm parameters that can be tweaked to tune performance, please refer to the [documentation](../settings#IndexSettings).
20+
Additionally, if you are using the approximate k-nearest neighbor method, you should specify `knn.space_type` to the space that you are interested in. This setting cannot be changed after it is set. Please refer to the [spaces section](#spaces) to see what spaces we support! By default, `index.knn.space_type` is `l2`. For more information on index settings, such as algorithm parameters that can be tweaked to tune performance, please refer to the [documentation](settings#IndexSettings).
2121

2222
Next, you must add one or more fields of the `knn_vector` data type. Here is an example that creates an index with two `knn_vector` fields and uses cosine similarity:
2323

docs/knn/index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: default
3-
title: KNN
3+
title: k-NN
44
nav_order: 50
55
has_children: true
66
has_toc: false
@@ -20,23 +20,23 @@ This plugin supports three different methods for obtaining the k-nearest neighbo
2020

2121
Approximate k-NN is the best choice for searches over large indices (i.e. hundreds of thousands of vectors or more) that require low latency. Approximate k-NN should not be used if a filter will be applied on the index before the k-NN search, greatly reducing the number of vectors to be searched. In this case, either the script scoring method or the painless extensions should be used.
2222

23-
For more details refer to the [Approximate k-NN section](../approximate-knn).
23+
For more details refer to the [Approximate k-NN section](approximate-knn).
2424

2525
2. **Script Score k-NN**
2626

2727
The second method extends Elasticsearch's script scoring functionality to execute a brute force, exact k-NN search over "knn_vector" fields or fields that can represent binary objects. With this approach, users are able to run k-NN search on a subset of vectors in their index (sometimes referred to as a pre-filter search).
2828

2929
This approach should be used for searches over smaller bodies of documents or when a pre-filter is needed. Using this approach on large indices may lead to high latencies.
3030

31-
For more details refer to the [k-NN Script Score section](../knn-score-script).
31+
For more details refer to the [k-NN Script Score section](knn-score-script).
3232

3333
3. **Painless extensions**
3434

3535
The third method adds the distance functions as painless extensions that can be used in more complex combinations. Similar to the k-NN Script Score, this method can be used to perform a brute force, exact k-NN search across an index and supports pre-filtering.
3636

3737
This approach has slightly slower query performance compared to Script Score k-NN. This approach should be preferred over Script Score k-NN if the use case requires more customization over the final score.
3838

39-
For more details refer to the [painless functions sectior](../painless-functions).
39+
For more details refer to the [painless functions section](painless-functions).
4040

4141

4242
Overall, for larger data sets, users should generally choose the approximate nearest neighbor method, because it scales significantly better. For smaller data sets, where a user may want to apply a filter, they should choose the custom scoring approach. If users have a more complex use case where they need to use a distance function as part of their scoring method, they should use the painless scripting approach.

docs/knn/jni-library.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
layout: default
33
title: JNI Library
44
nav_order: 5
5-
parent: KNN
5+
parent: k-NN
66
has_children: false
77
---
88

docs/knn/knn-score-script.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
layout: default
33
title: Exact k-NN with Scoring Script
44
nav_order: 2
5-
parent: KNN
5+
parent: k-NN
66
has_children: false
77
has_math: true
88
---
@@ -101,7 +101,7 @@ All parameters are required.
101101
*Note* -- After ODFE 1.11, `vector` was replaced by `query_value` due to the addition of the `bithamming` space.
102102

103103

104-
The [post filter example in the approximate approach](../approximate-knn#UsingApproximatek-NNWithFilters) shows a search that returns fewer than `k` results. If you want to avoid this situation, the score script method lets you essentially invert the order of events. In other words, you can filter down the set of documents you want to execute the k-nearest neighbor search over.
104+
The [post filter example in the approximate approach](../approximate-knn/#using-approximate-k-nn-with-filters) shows a search that returns fewer than `k` results. If you want to avoid this situation, the score script method lets you essentially invert the order of events. In other words, you can filter down the set of documents you want to execute the k-nearest neighbor search over.
105105

106106
This example shows a pre-filter approach to k-NN search with the score script approach. First, create the index:
107107

docs/knn/painless-functions.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
layout: default
33
title: k-NN Painless Extensions
44
nav_order: 3
5-
parent: KNN
5+
parent: k-NN
66
has_children: false
77
has_math: true
88
---
@@ -13,7 +13,7 @@ With the k-NN Plugin's Painless Scripting extensions, you can use k-NN distance
1313

1414
## Get started with k-NN's Painless Scripting Functions
1515

16-
To use k-NN's Painless Scripting functions, first, you still need to create an index with `knn_vector` fields as was done in [k-NN score script](../knn-score-script#Getting_started_with_the_score_script). Once the index is created and you have ingested some data, you can use the painless extensions like so:
16+
To use k-NN's Painless Scripting functions, first, you still need to create an index with `knn_vector` fields as was done in [k-NN score script](../knn-score-script#Getting-started-with-the-score-script). Once the index is created and you have ingested some data, you can use the painless extensions like so:
1717

1818
```
1919
GET my-knn-index-2/_search
@@ -57,19 +57,19 @@ The following table contains the available painless functions the k-NN plugin pr
5757
</thead>
5858
<tr>
5959
<td>l2Squared</td>
60-
<td>`float l2Squared (float[] queryVector, doc['vector field'])`</td>
60+
<td><code>float l2Squared (float[] queryVector, doc['vector field'])</code></td>
6161
<td>This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.</td>
6262
</tr>
6363
<tr>
6464
<td>cosineSimilarity</td>
65-
<td>float cosineSimilarity (float[] queryVector, doc['vector field'])</td>
66-
<td>Cosine similarity is inner product of the query vector and document vector normalized to both have length 1. If magnitude of the query vector does not change throughout the query, users can pass magnitude of query vector optionally to improve the performance instead of calculating the magnitude every time for every filtered document: `float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)`. In general, range of cosine similarity is [-1, 1], but in case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since tf-idf cannot be negative. Hence, we add 1.0 to the cosine similarity to score always positive. </td>
65+
<td><code>float cosineSimilarity (float[] queryVector, doc['vector field'])</code></td>
66+
<td>Cosine similarity is inner product of the query vector and document vector normalized to both have length 1. If magnitude of the query vector does not change throughout the query, users can pass magnitude of query vector optionally to improve the performance instead of calculating the magnitude every time for every filtered document: <code>float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)</code>. In general, range of cosine similarity is [-1, 1], but in case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since tf-idf cannot be negative. Hence, we add 1.0 to the cosine similarity to score always positive. </td>
6767
</tr>
6868
</table>
6969

7070

7171
## Constraints
72-
1. If a document’s knn_vector field has different dimensions than the query, the function throws an IllegalArgumentException.
72+
1. If a document’s `knn_vector` field has different dimensions than the query, the function throws an `IllegalArgumentException`.
7373
2. If a vector field doesn't have a value, the function throws an IllegalStateException.
7474
You can avoid this situation by first checking if a document has a value for the field:
7575
```

docs/knn/performance-tuning.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: default
33
title: Performance Tuning
4-
parent: KNN
4+
parent: k-NN
55
nav_order: 7
66
---
77

@@ -35,7 +35,7 @@ Having replicas set to 0, will avoid duplicate construction of graphs in both pr
3535

3636
3. Increase number of indexing threads
3737

38-
If the hardware we choose has multiple cores, we could allow multiple threads in graph construction and there by speed up the indexing process. You could determine the number of threads to be alloted by using the [knnalgo_paramindex_thread_qty]() setting.
38+
If the hardware we choose has multiple cores, we could allow multiple threads in graph construction and there by speed up the indexing process. You could determine the number of threads to be alloted by using the [knn.algo_param.index_thread_qty](../settings/#Cluster-settings) setting.
3939

4040
Please keep an eye on CPU utilization and choose right number of threads. Since graph construction is costly, having multiple threads can put additional load on CPU.
4141

@@ -94,4 +94,4 @@ As an example, assume that we have 1 Million vectors with dimension of 256 and M
9494

9595
The standard KNN query and custom scoring option perform differently. Test using a representative set of documents to see if the search results and latencies match your expectations.
9696

97-
Custom scoring works best if the initial filter reduces the number of documents to no more than 20,000. Increasing shard count can improve latencies, but be sure to keep shard size within [the recommended guidelines](../elasticsearch/#primary-and-replica-shards).
97+
Custom scoring works best if the initial filter reduces the number of documents to no more than 20,000. Increasing shard count can improve latencies, but be sure to keep shard size within [the recommended guidelines](../../elasticsearch/#primary-and-replica-shards).

docs/knn/settings.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: default
33
title: Settings
4-
parent: KNN
4+
parent: k-NN
55
nav_order: 6
66
---
77

0 commit comments

Comments
 (0)