You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[8.19] Update sparse_vector field mapping to include default setting for token pruning (#129089) (#129890)
* Update sparse_vector field mapping to include default setting for token pruning (#129089)
* Initial checkin of refactored index_options code
* [CI] Auto commit changes from spotless
* initial unit testing
* complete unit tests; add yaml tests
* [CI] Auto commit changes from spotless
* register test feature for sparse vector
* Update docs/changelog/129089.yaml
* update changelog
* add docs
* explicit set default index_options if null
* [CI] Auto commit changes from spotless
* update yaml tests; update docs
* fix yaml tests
* readd auth for teardown
* only serialize index options if not default
* [CI] Auto commit changes from spotless
* serialization refactor; pass index version around
* [CI] Auto commit changes from spotless
* fix transport versions merge
* fix up docs
* [CI] Auto commit changes from spotless
* fix docs; add include_defaults unit and yaml test
* [CI] Auto commit changes from spotless
* override getIndexReaderManager for SemanticQueryBuilderTests
* [CI] Auto commit changes from spotless
* cleanup mapper/builder/tests; index vers. in type
still need to refactor / clean YAML tests
* [CI] Auto commit changes from spotless
* cleanups to mapper tests for clarity
* [CI] Auto commit changes from spotless
* move feature into mappers; fix yaml tests
* cleanups; add comments; remove redundant test
* [CI] Auto commit changes from spotless
* escape more periods in the YAML tests
* cleanup mapper and type tests
* [CI] Auto commit changes from spotless
* rename mapping for previous index test
* set explicit number of shards for yaml test
---------
Co-authored-by: elasticsearchmachine <[email protected]>
Co-authored-by: Kathleen DeRusso <[email protected]>
(cherry picked from commit a671505)
# Conflicts:
# docs/reference/elasticsearch/mapping-reference/sparse-vector.md
# server/src/main/java/org/elasticsearch/TransportVersions.java
# server/src/main/java/org/elasticsearch/index/IndexVersions.java
# server/src/main/java/org/elasticsearch/index/mapper/MapperFeatures.java
# server/src/test/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapperTests.java
# x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/queries/SemanticQueryBuilderTests.java
* Update changelog for version
* [CI] Auto commit changes from spotless
* Update docs to replace 9.1 with 8.19
* Rename 129089.yaml to 129890.yaml
* proper asciidocs; cleanups
* remove doc preview labels; cleanup test index ver.
* clean up docs
* add sparse vector token pruning tag
---------
Co-authored-by: elasticsearchmachine <[email protected]>
With any new indices created, token pruning will be turned on by default with appropriate defaults. You can control this behaviour using the optional `index_options` parameters for the field:
See <<semantic-search-elser, semantic search with ELSER>> for a complete example on adding documents to a `sparse_vector` mapped field using ELSER.
28
54
29
55
[[sparse-vectors-params]]
@@ -43,6 +69,43 @@ To benefit from reduced disk usage, you must either:
43
69
* Exclude the field from <<source-filtering, _source>>.
44
70
* Use <<synthetic-source,synthetic `_source`>>.
45
71
72
+
`index_options`::
73
+
(Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your <<query-dsl-sparse-vector-query, `sparse_vector query`>>, Elasticsearch will use the default options configured for the field, if any.
74
+
75
+
Parameters for `index_options` are:
76
+
77
+
[horizontal]
78
+
79
+
`prune`::
80
+
(Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true.
81
+
82
+
`pruning_config`::
83
+
(Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.
84
+
85
+
Parameters for `pruning_config` include:
86
+
87
+
[horizontal]
88
+
89
+
`tokens_freq_ratio_threshold`::
90
+
(Optional, integer) Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`.
91
+
92
+
`tokens_weight_threshold`::
93
+
(Optional, float) Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`.
94
+
95
+
NOTE: The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results.
96
+
97
+
When token pruning is applied, non-significant tokens will be pruned from the query.
98
+
Non-significant tokens can be defined as tokens that meet both of the following criteria:
99
+
100
+
* The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.
101
+
* The weight/score is so low that the token is likely not very relevant to the original term
102
+
103
+
Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned.
104
+
This ensures that:
105
+
106
+
* The tokens that are kept are frequent enough and have significant scoring.
107
+
* Very infrequent tokens that may not have as high of a score are removed.
0 commit comments