Skip to content

Commit 043e584

Browse files
committed
Update sparse_vector field mapping to include default setting for token pruning (elastic#129089)
* Initial checkin of refactored index_options code * [CI] Auto commit changes from spotless * initial unit testing * complete unit tests; add yaml tests * [CI] Auto commit changes from spotless * register test feature for sparse vector * Update docs/changelog/129089.yaml * update changelog * add docs * explicit set default index_options if null * [CI] Auto commit changes from spotless * update yaml tests; update docs * fix yaml tests * readd auth for teardown * only serialize index options if not default * [CI] Auto commit changes from spotless * serialization refactor; pass index version around * [CI] Auto commit changes from spotless * fix transport versions merge * fix up docs * [CI] Auto commit changes from spotless * fix docs; add include_defaults unit and yaml test * [CI] Auto commit changes from spotless * override getIndexReaderManager for SemanticQueryBuilderTests * [CI] Auto commit changes from spotless * cleanup mapper/builder/tests; index vers. in type still need to refactor / clean YAML tests * [CI] Auto commit changes from spotless * cleanups to mapper tests for clarity * [CI] Auto commit changes from spotless * move feature into mappers; fix yaml tests * cleanups; add comments; remove redundant test * [CI] Auto commit changes from spotless * escape more periods in the YAML tests * cleanup mapper and type tests * [CI] Auto commit changes from spotless * rename mapping for previous index test * set explicit number of shards for yaml test --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Kathleen DeRusso <[email protected]> (cherry picked from commit a671505) # Conflicts: # docs/reference/elasticsearch/mapping-reference/sparse-vector.md # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/index/IndexVersions.java # server/src/main/java/org/elasticsearch/index/mapper/MapperFeatures.java # server/src/test/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapperTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/queries/SemanticQueryBuilderTests.java
1 parent eec192f commit 043e584

File tree

17 files changed

+2505
-36
lines changed

17 files changed

+2505
-36
lines changed

docs/changelog/129089.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 129089
2+
summary: Update `sparse_vector` field mapping to include default setting for token pruning
3+
area: Mapping
4+
type: enhancement
5+
issues: []
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
---
2+
navigation_title: "Sparse vector"
3+
mapped_pages:
4+
- https://www.elastic.co/guide/en/elasticsearch/reference/current/sparse-vector.html
5+
---
6+
7+
# Sparse vector field type [sparse-vector]
8+
9+
10+
A `sparse_vector` field can index features and weights so that they can later be used to query documents in queries with a [`sparse_vector`](/reference/query-languages/query-dsl/query-dsl-sparse-vector-query.md). This field can also be used with a legacy [`text_expansion`](/reference/query-languages/query-dsl/query-dsl-text-expansion-query.md) query.
11+
12+
`sparse_vector` is the field type that should be used with [ELSER mappings](docs-content://solutions/search/semantic-search/semantic-search-elser-ingest-pipelines.md#elser-mappings).
13+
14+
```console
15+
PUT my-index
16+
{
17+
"mappings": {
18+
"properties": {
19+
"text.tokens": {
20+
"type": "sparse_vector"
21+
}
22+
}
23+
}
24+
}
25+
```
26+
27+
## Token pruning
28+
```{applies_to}
29+
stack: preview 9.1
30+
```
31+
32+
With any new indices created, token pruning will be turned on by default with appropriate defaults. You can control this behaviour using the optional `index_options` parameters for the field:
33+
34+
```console
35+
PUT my-index
36+
{
37+
"mappings": {
38+
"properties": {
39+
"text.tokens": {
40+
"type": "sparse_vector",
41+
"index_options": {
42+
"prune": true,
43+
"pruning_config": {
44+
"tokens_freq_ratio_threshold": 5,
45+
"tokens_weight_threshold": 0.4
46+
}
47+
}
48+
}
49+
}
50+
}
51+
}
52+
```
53+
54+
See [semantic search with ELSER](docs-content://solutions/search/semantic-search/semantic-search-elser-ingest-pipelines.md) for a complete example on adding documents to a `sparse_vector` mapped field using ELSER.
55+
56+
## Parameters for `sparse_vector` fields [sparse-vectors-params]
57+
58+
The following parameters are accepted by `sparse_vector` fields:
59+
60+
[store](/reference/elasticsearch/mapping-reference/mapping-store.md)
61+
: Indicates whether the field value should be stored and retrievable independently of the [_source](/reference/elasticsearch/mapping-reference/mapping-source-field.md) field. Accepted values: true or false (default). The field’s data is stored using term vectors, a disk-efficient structure compared to the original JSON input. The input map can be retrieved during a search request via the [`fields` parameter](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#search-fields-param). To benefit from reduced disk usage, you must either:
62+
63+
* Exclude the field from [_source](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering).
64+
* Use [synthetic `_source`](/reference/elasticsearch/mapping-reference/mapping-source-field.md#synthetic-source).
65+
66+
index_options {applies_to}`stack: preview 9.1`
67+
: (Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your [`sparse_vector` query](/reference/query-languages/query-dsl/query-dsl-sparse-vector-query.md), Elasticsearch will use the default options configured for the field, if any.
68+
69+
Parameters for `index_options` are:
70+
71+
`prune` {applies_to}`stack: preview 9.1`
72+
: (Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true.
73+
74+
`pruning_config` {applies_to}`stack: preview 9.1`
75+
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.
76+
77+
Parameters for `pruning_config` include:
78+
79+
`tokens_freq_ratio_threshold` {applies_to}`stack: preview 9.1`
80+
: (Optional, integer) Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`.
81+
82+
`tokens_weight_threshold` {applies_to}`stack: preview 9.1`
83+
: (Optional, float) Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`.
84+
85+
::::{note}
86+
The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results.
87+
::::
88+
89+
When token pruning is applied, non-significant tokens will be pruned from the query.
90+
Non-significant tokens can be defined as tokens that meet both of the following criteria:
91+
* The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.
92+
* The weight/score is so low that the token is likely not very relevant to the original term
93+
94+
Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned.
95+
This ensures that:
96+
* The tokens that are kept are frequent enough and have significant scoring.
97+
* Very infrequent tokens that may not have as high of a score are removed.
98+
99+
100+
## Multi-value sparse vectors [index-multi-value-sparse-vectors]
101+
102+
When passing in arrays of values for sparse vectors the max value for similarly named features is selected.
103+
104+
The paper Adapting Learned Sparse Retrieval for Long Documents ([https://arxiv.org/pdf/2305.18494.pdf](https://arxiv.org/pdf/2305.18494.pdf)) discusses this in more detail. In summary, research findings support representation aggregation typically outperforming score aggregation.
105+
106+
For instances where you want to have overlapping feature names use should store them separately or use nested fields.
107+
108+
Below is an example of passing in a document with overlapping feature names. Consider that in this example two categories exist for positive sentiment and negative sentiment. However, for the purposes of retrieval we also want the overall impact rather than specific sentiment. In the example `impact` is stored as a multi-value sparse vector and only the max values of overlapping names are stored. More specifically the final `GET` query here returns a `_score` of ~1.2 (which is the `max(impact.delicious[0], impact.delicious[1])` and is approximate because we have a relative error of 0.4% as explained below)
109+
110+
```console
111+
PUT my-index-000001
112+
{
113+
"mappings": {
114+
"properties": {
115+
"text": {
116+
"type": "text",
117+
"analyzer": "standard"
118+
},
119+
"impact": {
120+
"type": "sparse_vector"
121+
},
122+
"positive": {
123+
"type": "sparse_vector"
124+
},
125+
"negative": {
126+
"type": "sparse_vector"
127+
}
128+
}
129+
}
130+
}
131+
132+
POST my-index-000001/_doc
133+
{
134+
"text": "I had some terribly delicious carrots.",
135+
"impact": [{"I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8},
136+
{"I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4}],
137+
"positive": {"I": 0.55, "had": 0.4, "some": 0.28, "terribly": 0.01, "delicious": 1.2, "carrots": 0.8},
138+
"negative": {"I": 0.54, "had": 0.4, "some": 0.28, "terribly": 2.01, "delicious": 0.02, "carrots": 0.4}
139+
}
140+
141+
GET my-index-000001/_search
142+
{
143+
"query": {
144+
"term": {
145+
"impact": {
146+
"value": "delicious"
147+
}
148+
}
149+
}
150+
}
151+
```
152+
153+
::::{note}
154+
`sparse_vector` fields can not be included in indices that were **created** on {{es}} versions between 8.0 and 8.10
155+
::::
156+
157+
158+
::::{note}
159+
`sparse_vector` fields only support strictly positive values. Negative values will be rejected.
160+
::::
161+
162+
163+
::::{note}
164+
`sparse_vector` fields do not support [analyzers](docs-content://manage-data/data-store/text-analysis.md), querying, sorting or aggregating. They may only be used within specialized queries. The recommended query to use on these fields are [`sparse_vector`](/reference/query-languages/query-dsl/query-dsl-sparse-vector-query.md) queries. They may also be used within legacy [`text_expansion`](/reference/query-languages/query-dsl/query-dsl-text-expansion-query.md) queries.
165+
::::
166+
167+
168+
::::{note}
169+
`sparse_vector` fields only preserve 9 significant bits for the precision, which translates to a relative error of about 0.4%.
170+
::::
171+
172+
173+

server/src/main/java/org/elasticsearch/TransportVersions.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -247,6 +247,8 @@ static TransportVersion def(int id) {
247247
public static final TransportVersion STREAMS_LOGS_SUPPORT_8_19 = def(8_841_0_54);
248248
public static final TransportVersion ML_INFERENCE_CUSTOM_SERVICE_INPUT_TYPE_8_19 = def(8_841_0_55);
249249
public static final TransportVersion RANDOM_SAMPLER_QUERY_BUILDER_8_19 = def(8_841_0_56);
250+
public static final TransportVersion ML_INFERENCE_SAGEMAKER_ELASTIC_8_19 = def(8_841_0_57);
251+
public static final TransportVersion SPARSE_VECTOR_FIELD_PRUNING_OPTIONS_8_19 = def(8_841_0_58);
250252

251253
/*
252254
* STOP! READ THIS FIRST! No, really,

server/src/main/java/org/elasticsearch/index/IndexVersions.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,7 @@ private static IndexVersion def(int id, Version luceneVersion) {
135135
public static final IndexVersion INDEX_INT_SORT_INT_TYPE_8_19 = def(8_532_0_00, Version.LUCENE_9_12_1);
136136
public static final IndexVersion MAPPER_TEXT_MATCH_ONLY_MULTI_FIELDS_DEFAULT_NOT_STORED_8_19 = def(8_533_0_00, Version.LUCENE_9_12_1);
137137
public static final IndexVersion UPGRADE_TO_LUCENE_9_12_2 = def(8_534_0_00, Version.LUCENE_9_12_2);
138+
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT_BACKPORT_8_X = def(8_535_0_00, Version.LUCENE_9_12_2);
138139

139140
/*
140141
* STOP! READ THIS FIRST! No, really,

server/src/main/java/org/elasticsearch/index/mapper/MapperFeatures.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.RESCORE_VECTOR_QUANTIZED_VECTOR_MAPPING;
2121
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.RESCORE_ZERO_VECTOR_QUANTIZED_VECTOR_MAPPING;
2222
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ;
23+
import static org.elasticsearch.index.mapper.vectors.SparseVectorFieldMapper.SPARSE_VECTOR_INDEX_OPTIONS_FEATURE;
2324

2425
/**
2526
* Spec for mapper-related features.
@@ -97,7 +98,8 @@ public Set<NodeFeature> getTestFeatures() {
9798
NPE_ON_DIMS_UPDATE_FIX,
9899
RESCORE_VECTOR_QUANTIZED_VECTOR_MAPPING,
99100
RESCORE_ZERO_VECTOR_QUANTIZED_VECTOR_MAPPING,
100-
USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ
101+
USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ,
102+
SPARSE_VECTOR_INDEX_OPTIONS_FEATURE
101103
);
102104
}
103105
}

0 commit comments

Comments
 (0)