Skip to content

Commit 4fad6e1

Browse files
markjhoyelasticsearchmachine
andauthored
[8.19] Update sparse_vector field mapping to include default setting for token pruning (#129089) (#129890)
* Update sparse_vector field mapping to include default setting for token pruning (#129089) * Initial checkin of refactored index_options code * [CI] Auto commit changes from spotless * initial unit testing * complete unit tests; add yaml tests * [CI] Auto commit changes from spotless * register test feature for sparse vector * Update docs/changelog/129089.yaml * update changelog * add docs * explicit set default index_options if null * [CI] Auto commit changes from spotless * update yaml tests; update docs * fix yaml tests * readd auth for teardown * only serialize index options if not default * [CI] Auto commit changes from spotless * serialization refactor; pass index version around * [CI] Auto commit changes from spotless * fix transport versions merge * fix up docs * [CI] Auto commit changes from spotless * fix docs; add include_defaults unit and yaml test * [CI] Auto commit changes from spotless * override getIndexReaderManager for SemanticQueryBuilderTests * [CI] Auto commit changes from spotless * cleanup mapper/builder/tests; index vers. in type still need to refactor / clean YAML tests * [CI] Auto commit changes from spotless * cleanups to mapper tests for clarity * [CI] Auto commit changes from spotless * move feature into mappers; fix yaml tests * cleanups; add comments; remove redundant test * [CI] Auto commit changes from spotless * escape more periods in the YAML tests * cleanup mapper and type tests * [CI] Auto commit changes from spotless * rename mapping for previous index test * set explicit number of shards for yaml test --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Kathleen DeRusso <[email protected]> (cherry picked from commit a671505) # Conflicts: # docs/reference/elasticsearch/mapping-reference/sparse-vector.md # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/index/IndexVersions.java # server/src/main/java/org/elasticsearch/index/mapper/MapperFeatures.java # server/src/test/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapperTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/queries/SemanticQueryBuilderTests.java * Update changelog for version * [CI] Auto commit changes from spotless * Update docs to replace 9.1 with 8.19 * Rename 129089.yaml to 129890.yaml * proper asciidocs; cleanups * remove doc preview labels; cleanup test index ver. * clean up docs * add sparse vector token pruning tag --------- Co-authored-by: elasticsearchmachine <[email protected]>
1 parent 77def07 commit 4fad6e1

File tree

17 files changed

+2389
-37
lines changed

17 files changed

+2389
-37
lines changed

docs/changelog/129089.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 129089
2+
summary: Update `sparse_vector` field mapping to include default setting for token pruning
3+
area: Mapping
4+
type: enhancement
5+
issues: []

docs/reference/mapping/types/sparse-vector.asciidoc

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,32 @@ PUT my-index
2424
}
2525
--------------------------------------------------
2626

27+
[[sparse-vector-token-pruning]]
28+
==== Token pruning
29+
30+
With any new indices created, token pruning will be turned on by default with appropriate defaults. You can control this behaviour using the optional `index_options` parameters for the field:
31+
32+
[source,console]
33+
--------------------------------------------------
34+
PUT my-index
35+
{
36+
"mappings": {
37+
"properties": {
38+
"text.tokens": {
39+
"type": "sparse_vector",
40+
"index_options": {
41+
"prune": true,
42+
"pruning_config": {
43+
"tokens_freq_ratio_threshold": 5,
44+
"tokens_weight_threshold": 0.4
45+
}
46+
}
47+
}
48+
}
49+
}
50+
}
51+
--------------------------------------------------
52+
2753
See <<semantic-search-elser, semantic search with ELSER>> for a complete example on adding documents to a `sparse_vector` mapped field using ELSER.
2854

2955
[[sparse-vectors-params]]
@@ -43,6 +69,43 @@ To benefit from reduced disk usage, you must either:
4369
* Exclude the field from <<source-filtering, _source>>.
4470
* Use <<synthetic-source,synthetic `_source`>>.
4571

72+
`index_options`::
73+
(Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your <<query-dsl-sparse-vector-query, `sparse_vector query`>>, Elasticsearch will use the default options configured for the field, if any.
74+
75+
Parameters for `index_options` are:
76+
77+
[horizontal]
78+
79+
`prune`::
80+
(Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true.
81+
82+
`pruning_config`::
83+
(Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.
84+
85+
Parameters for `pruning_config` include:
86+
87+
[horizontal]
88+
89+
`tokens_freq_ratio_threshold`::
90+
(Optional, integer) Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`.
91+
92+
`tokens_weight_threshold`::
93+
(Optional, float) Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`.
94+
95+
NOTE: The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results.
96+
97+
When token pruning is applied, non-significant tokens will be pruned from the query.
98+
Non-significant tokens can be defined as tokens that meet both of the following criteria:
99+
100+
* The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.
101+
* The weight/score is so low that the token is likely not very relevant to the original term
102+
103+
Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned.
104+
This ensures that:
105+
106+
* The tokens that are kept are frequent enough and have significant scoring.
107+
* Very infrequent tokens that may not have as high of a score are removed.
108+
46109
[[index-multi-value-sparse-vectors]]
47110
==== Multi-value sparse vectors
48111

server/src/main/java/org/elasticsearch/TransportVersions.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,7 @@ static TransportVersion def(int id) {
248248
public static final TransportVersion ML_INFERENCE_CUSTOM_SERVICE_INPUT_TYPE_8_19 = def(8_841_0_55);
249249
public static final TransportVersion RANDOM_SAMPLER_QUERY_BUILDER_8_19 = def(8_841_0_56);
250250
public static final TransportVersion ML_INFERENCE_SAGEMAKER_ELASTIC_8_19 = def(8_841_0_57);
251+
public static final TransportVersion SPARSE_VECTOR_FIELD_PRUNING_OPTIONS_8_19 = def(8_841_0_58);
251252

252253
/*
253254
* STOP! READ THIS FIRST! No, really,

server/src/main/java/org/elasticsearch/index/IndexVersions.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,7 @@ private static IndexVersion def(int id, Version luceneVersion) {
135135
public static final IndexVersion INDEX_INT_SORT_INT_TYPE_8_19 = def(8_532_0_00, Version.LUCENE_9_12_1);
136136
public static final IndexVersion MAPPER_TEXT_MATCH_ONLY_MULTI_FIELDS_DEFAULT_NOT_STORED_8_19 = def(8_533_0_00, Version.LUCENE_9_12_1);
137137
public static final IndexVersion UPGRADE_TO_LUCENE_9_12_2 = def(8_534_0_00, Version.LUCENE_9_12_2);
138+
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT_BACKPORT_8_X = def(8_535_0_00, Version.LUCENE_9_12_2);
138139

139140
/*
140141
* STOP! READ THIS FIRST! No, really,

server/src/main/java/org/elasticsearch/index/mapper/MapperFeatures.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.RESCORE_VECTOR_QUANTIZED_VECTOR_MAPPING;
2121
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.RESCORE_ZERO_VECTOR_QUANTIZED_VECTOR_MAPPING;
2222
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ;
23+
import static org.elasticsearch.index.mapper.vectors.SparseVectorFieldMapper.SPARSE_VECTOR_INDEX_OPTIONS_FEATURE;
2324

2425
/**
2526
* Spec for mapper-related features.
@@ -97,7 +98,8 @@ public Set<NodeFeature> getTestFeatures() {
9798
NPE_ON_DIMS_UPDATE_FIX,
9899
RESCORE_VECTOR_QUANTIZED_VECTOR_MAPPING,
99100
RESCORE_ZERO_VECTOR_QUANTIZED_VECTOR_MAPPING,
100-
USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ
101+
USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ,
102+
SPARSE_VECTOR_INDEX_OPTIONS_FEATURE
101103
);
102104
}
103105
}

0 commit comments

Comments
 (0)