Skip to content

Commit cfc37f2

Browse files
authored
Merge branch 'main' into remote-lookup-join
2 parents af92abf + ba103f1 commit cfc37f2

File tree

164 files changed

+6607
-2416
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

164 files changed

+6607
-2416
lines changed

build-tools-internal/src/main/resources/forbidden/es-all-signatures.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,3 +61,7 @@ org.apache.logging.log4j.message.ParameterizedMessage#<init>(java.lang.String, j
6161

6262
@defaultMessage Use WriteLoadForecaster#getForecastedWriteLoad instead
6363
org.elasticsearch.cluster.metadata.IndexMetadata#getForecastedWriteLoad()
64+
65+
@defaultMessage Use org.elasticsearch.index.codec.vectors.OptimizedScalarQuantizer instead
66+
org.apache.lucene.util.quantization.OptimizedScalarQuantizer#<init>(org.apache.lucene.index.VectorSimilarityFunction, float, int)
67+
org.apache.lucene.util.quantization.OptimizedScalarQuantizer#<init>(org.apache.lucene.index.VectorSimilarityFunction)

docs/changelog/128854.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
pr: 128854
2+
summary: Mark token pruning for sparse vector as GA
3+
area: Machine Learning
4+
type: feature
5+
issues: []
6+
highlight:
7+
title: Mark Token Pruning for Sparse Vector as GA
8+
body: |-
9+
Token pruning for sparse_vector queries has been live since 8.13 as tech preview.
10+
As of 8.19.0 and 9.1.0, this is now generally available.
11+
notable: true

docs/changelog/129089.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 129089
2+
summary: Update `sparse_vector` field mapping to include default setting for token pruning
3+
area: Mapping
4+
type: enhancement
5+
issues: []

docs/changelog/129413.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 129413
2+
summary: '`SageMaker` Elastic Payload'
3+
area: Machine Learning
4+
type: enhancement
5+
issues: []

docs/changelog/129904.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 129904
2+
summary: Reverse disordered-version warning message
3+
area: Infra/Core
4+
type: bug
5+
issues: []

docs/reference/elasticsearch/mapping-reference/dense-vector.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ In many cases, a brute-force kNN search is not efficient enough. For this reason
5555

5656
Unmapped array fields of float elements with size between 128 and 4096 are dynamically mapped as `dense_vector` with a default similariy of `cosine`. You can override the default similarity by explicitly mapping the field as `dense_vector` with the desired similarity.
5757

58-
Indexing is enabled by default for dense vector fields and indexed as `int8_hnsw`. When indexing is enabled, you can define the vector similarity to use in kNN search:
58+
Indexing is enabled by default for dense vector fields and indexed as `bbq_hnsw` if dimensions are greater than or equal to 384, otherwise they are indexed as `int8_hnsw`. When indexing is enabled, you can define the vector similarity to use in kNN search:
5959

6060
```console
6161
PUT my-index-2
@@ -105,7 +105,7 @@ The `dense_vector` type supports quantization to reduce the memory footprint req
105105

106106
When using a quantized format, you may want to oversample and rescore the results to improve accuracy. See [oversampling and rescoring](docs-content://solutions/search/vector/knn.md#dense-vector-knn-search-rescoring) for more information.
107107

108-
To use a quantized index, you can set your index type to `int8_hnsw`, `int4_hnsw`, or `bbq_hnsw`. When indexing `float` vectors, the current default index type is `int8_hnsw`.
108+
To use a quantized index, you can set your index type to `int8_hnsw`, `int4_hnsw`, or `bbq_hnsw`. When indexing `float` vectors, the current default index type is `bbq_hnsw` for vectors with greater than or equal to 384 dimensions, otherwise it's `int8_hnsw`.
109109

110110
Quantized vectors can use [oversampling and rescoring](docs-content://solutions/search/vector/knn.md#dense-vector-knn-search-rescoring) to improve accuracy on approximate kNN search results.
111111

@@ -255,9 +255,9 @@ $$$dense-vector-index-options$$$
255255
`type`
256256
: (Required, string) The type of kNN algorithm to use. Can be either any of:
257257
* `hnsw` - This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) for scalable approximate kNN search. This supports all `element_type` values.
258-
* `int8_hnsw` - The default index type for float vectors. This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically scalar quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 4x at the cost of some accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
258+
* `int8_hnsw` - The default index type for float vectors with less than 384 dimensions. This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically scalar quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 4x at the cost of some accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
259259
* `int4_hnsw` - This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically scalar quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 8x at the cost of some accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
260-
* `bbq_hnsw` - This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically binary quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 32x at the cost of accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
260+
* `bbq_hnsw` - The default index type for float vectors with greater than or equal to 384 dimensions. This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically binary quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 32x at the cost of accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
261261
* `flat` - This utilizes a brute-force search algorithm for exact kNN search. This supports all `element_type` values.
262262
* `int8_flat` - This utilizes a brute-force search algorithm in addition to automatically scalar quantization. Only supports `element_type` of `float`.
263263
* `int4_flat` - This utilizes a brute-force search algorithm in addition to automatically half-byte scalar quantization. Only supports `element_type` of `float`.

docs/reference/elasticsearch/mapping-reference/sparse-vector.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,33 @@ PUT my-index
2424
}
2525
```
2626

27+
## Token pruning
28+
```{applies_to}
29+
stack: preview 9.1
30+
```
31+
32+
With any new indices created, token pruning will be turned on by default with appropriate defaults. You can control this behaviour using the optional `index_options` parameters for the field:
33+
34+
```console
35+
PUT my-index
36+
{
37+
"mappings": {
38+
"properties": {
39+
"text.tokens": {
40+
"type": "sparse_vector",
41+
"index_options": {
42+
"prune": true,
43+
"pruning_config": {
44+
"tokens_freq_ratio_threshold": 5,
45+
"tokens_weight_threshold": 0.4
46+
}
47+
}
48+
}
49+
}
50+
}
51+
}
52+
```
53+
2754
See [semantic search with ELSER](docs-content://solutions/search/semantic-search/semantic-search-elser-ingest-pipelines.md) for a complete example on adding documents to a `sparse_vector` mapped field using ELSER.
2855

2956
## Parameters for `sparse_vector` fields [sparse-vectors-params]
@@ -36,6 +63,38 @@ The following parameters are accepted by `sparse_vector` fields:
3663
* Exclude the field from [_source](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering).
3764
* Use [synthetic `_source`](/reference/elasticsearch/mapping-reference/mapping-source-field.md#synthetic-source).
3865

66+
index_options {applies_to}`stack: preview 9.1`
67+
: (Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your [`sparse_vector` query](/reference/query-languages/query-dsl/query-dsl-sparse-vector-query.md), Elasticsearch will use the default options configured for the field, if any.
68+
69+
Parameters for `index_options` are:
70+
71+
`prune` {applies_to}`stack: preview 9.1`
72+
: (Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true.
73+
74+
`pruning_config` {applies_to}`stack: preview 9.1`
75+
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.
76+
77+
Parameters for `pruning_config` include:
78+
79+
`tokens_freq_ratio_threshold` {applies_to}`stack: preview 9.1`
80+
: (Optional, integer) Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`.
81+
82+
`tokens_weight_threshold` {applies_to}`stack: preview 9.1`
83+
: (Optional, float) Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`.
84+
85+
::::{note}
86+
The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results.
87+
::::
88+
89+
When token pruning is applied, non-significant tokens will be pruned from the query.
90+
Non-significant tokens can be defined as tokens that meet both of the following criteria:
91+
* The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.
92+
* The weight/score is so low that the token is likely not very relevant to the original term
93+
94+
Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned.
95+
This ensures that:
96+
* The tokens that are kept are frequent enough and have significant scoring.
97+
* Very infrequent tokens that may not have as high of a score are removed.
3998

4099

41100
## Multi-value sparse vectors [index-multi-value-sparse-vectors]

0 commit comments

Comments
 (0)