Skip to content
5 changes: 5 additions & 0 deletions docs/changelog/129089.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 129089
summary: Update `sparse_vector` field mapping to include default setting for token pruning
area: Mapping
type: enhancement
issues: []
69 changes: 69 additions & 0 deletions docs/reference/mapping/types/sparse-vector.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,33 @@ PUT my-index
}
--------------------------------------------------

## Token pruning

preview::["sparse vector index options for token pruning is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features."]

With any new indices created, token pruning will be turned on by default with appropriate defaults. You can control this behaviour using the optional `index_options` parameters for the field:

[source,console]
--------------------------------------------------
PUT my-index
{
"mappings": {
"properties": {
"text.tokens": {
"type": "sparse_vector",
"index_options": {
"prune": true,
"pruning_config": {
"tokens_freq_ratio_threshold": 5,
"tokens_weight_threshold": 0.4
}
}
}
}
}
}
--------------------------------------------------

See <<semantic-search-elser, semantic search with ELSER>> for a complete example on adding documents to a `sparse_vector` mapped field using ELSER.

[[sparse-vectors-params]]
Expand All @@ -43,6 +70,48 @@ To benefit from reduced disk usage, you must either:
* Exclude the field from <<source-filtering, _source>>.
* Use <<synthetic-source,synthetic `_source`>>.

<<mapping-index-options,index_options>>::
preview::[]
(Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your <<query-dsl-sparse-vector-query, `sparse_vector query`>>, Elasticsearch will use the default options configured for the field, if any.

Parameters for `index_options` are:

[horizontal]

<<mapping-index-options-prune,prune>>::
preview::[]
(Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true.

<<mapping-index-options-pruning_config,pruning_config>>::
preview:[]
(Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.

Parameters for `pruning_config` include:

[horizontal]

<<mapping-index-options-pruning_config-tokens_freq_ratio_threshold, tokens_freq_ratio_threshold>>::
preview:[]
(Optional, integer) Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`.

<<mapping-index-options-pruning_config-tokens_weight_threshold,tokens_weight_threshold>>::
preview:[]
(Optional, float) Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`.

NOTE: The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results.

When token pruning is applied, non-significant tokens will be pruned from the query.
Non-significant tokens can be defined as tokens that meet both of the following criteria:

* The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.
* The weight/score is so low that the token is likely not very relevant to the original term

Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned.
This ensures that:

* The tokens that are kept are frequent enough and have significant scoring.
* Very infrequent tokens that may not have as high of a score are removed.

[[index-multi-value-sparse-vectors]]
==== Multi-value sparse vectors

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,7 @@ static TransportVersion def(int id) {
public static final TransportVersion ML_INFERENCE_CUSTOM_SERVICE_INPUT_TYPE_8_19 = def(8_841_0_55);
public static final TransportVersion RANDOM_SAMPLER_QUERY_BUILDER_8_19 = def(8_841_0_56);
public static final TransportVersion ML_INFERENCE_SAGEMAKER_ELASTIC_8_19 = def(8_841_0_57);
public static final TransportVersion SPARSE_VECTOR_FIELD_PRUNING_OPTIONS_8_19 = def(8_841_0_58);

/*
* STOP! READ THIS FIRST! No, really,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ private static IndexVersion def(int id, Version luceneVersion) {
public static final IndexVersion INDEX_INT_SORT_INT_TYPE_8_19 = def(8_532_0_00, Version.LUCENE_9_12_1);
public static final IndexVersion MAPPER_TEXT_MATCH_ONLY_MULTI_FIELDS_DEFAULT_NOT_STORED_8_19 = def(8_533_0_00, Version.LUCENE_9_12_1);
public static final IndexVersion UPGRADE_TO_LUCENE_9_12_2 = def(8_534_0_00, Version.LUCENE_9_12_2);
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT_BACKPORT_8_X = def(8_535_0_00, Version.LUCENE_9_12_2);

/*
* STOP! READ THIS FIRST! No, really,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.RESCORE_VECTOR_QUANTIZED_VECTOR_MAPPING;
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.RESCORE_ZERO_VECTOR_QUANTIZED_VECTOR_MAPPING;
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ;
import static org.elasticsearch.index.mapper.vectors.SparseVectorFieldMapper.SPARSE_VECTOR_INDEX_OPTIONS_FEATURE;

/**
* Spec for mapper-related features.
Expand Down Expand Up @@ -97,7 +98,8 @@ public Set<NodeFeature> getTestFeatures() {
NPE_ON_DIMS_UPDATE_FIX,
RESCORE_VECTOR_QUANTIZED_VECTOR_MAPPING,
RESCORE_ZERO_VECTOR_QUANTIZED_VECTOR_MAPPING,
USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ
USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ,
SPARSE_VECTOR_INDEX_OPTIONS_FEATURE
);
}
}
Loading