Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
4be9290
Initial checkin of refactored index_options code
markjhoy Jun 6, 2025
f0f0279
[CI] Auto commit changes from spotless
Jun 6, 2025
3281cc2
initial unit testing
markjhoy Jun 9, 2025
39406c3
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 9, 2025
854a78e
complete unit tests; add yaml tests
markjhoy Jun 9, 2025
fb8623a
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 9, 2025
110d04e
[CI] Auto commit changes from spotless
Jun 9, 2025
0ce5679
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 10, 2025
46bd54d
register test feature for sparse vector
markjhoy Jun 10, 2025
e5d5a98
Update docs/changelog/129089.yaml
markjhoy Jun 10, 2025
a8cfb81
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 10, 2025
f1a07c8
update changelog
markjhoy Jun 10, 2025
2fa6a88
add docs
markjhoy Jun 10, 2025
7151d4a
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 11, 2025
f98c894
explicit set default index_options if null
markjhoy Jun 11, 2025
68949c0
[CI] Auto commit changes from spotless
Jun 11, 2025
aeedc14
update yaml tests; update docs
markjhoy Jun 12, 2025
c98af00
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 12, 2025
6e6c667
fix yaml tests
markjhoy Jun 12, 2025
87ab9dd
readd auth for teardown
markjhoy Jun 12, 2025
afd01d1
only serialize index options if not default
markjhoy Jun 16, 2025
dc4b71b
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 16, 2025
3f4801b
[CI] Auto commit changes from spotless
Jun 16, 2025
6307f93
serialization refactor; pass index version around
markjhoy Jun 17, 2025
8294191
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 17, 2025
fd61d59
[CI] Auto commit changes from spotless
Jun 17, 2025
d812821
fix transport versions merge
markjhoy Jun 17, 2025
b7c9904
fix up docs
markjhoy Jun 17, 2025
350910c
[CI] Auto commit changes from spotless
Jun 17, 2025
aeec0ca
fix docs; add include_defaults unit and yaml test
markjhoy Jun 17, 2025
820ea84
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 17, 2025
7a68727
[CI] Auto commit changes from spotless
Jun 17, 2025
7f40f95
override getIndexReaderManager for SemanticQueryBuilderTests
markjhoy Jun 17, 2025
e99b510
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 17, 2025
67e1f8d
[CI] Auto commit changes from spotless
Jun 17, 2025
c165867
cleanup mapper/builder/tests; index vers. in type
markjhoy Jun 18, 2025
8ac4ebc
[CI] Auto commit changes from spotless
Jun 18, 2025
584030c
cleanups to mapper tests for clarity
markjhoy Jun 18, 2025
05ff647
[CI] Auto commit changes from spotless
Jun 18, 2025
1cab345
move feature into mappers; fix yaml tests
markjhoy Jun 18, 2025
71af331
cleanups; add comments; remove redundant test
markjhoy Jun 18, 2025
93872db
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 18, 2025
308961f
[CI] Auto commit changes from spotless
Jun 18, 2025
2d5bfb7
escape more periods in the YAML tests
markjhoy Jun 18, 2025
e9f7c40
cleanup mapper and type tests
markjhoy Jun 18, 2025
733ac3f
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 18, 2025
b30377c
[CI] Auto commit changes from spotless
Jun 18, 2025
5895490
rename mapping for previous index test
markjhoy Jun 18, 2025
31a3dfc
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 18, 2025
2957699
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 19, 2025
20d355d
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 19, 2025
cc37995
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 19, 2025
434a4a6
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 19, 2025
275cc30
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 20, 2025
4934dab
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 20, 2025
f659d39
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 22, 2025
4fca2eb
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 22, 2025
8a80167
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 23, 2025
583e5f1
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 23, 2025
65817d2
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
kderusso Jun 23, 2025
874b8b3
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 23, 2025
4d6a83e
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 23, 2025
8623573
set explicit number of shards for yaml test
markjhoy Jun 23, 2025
a2f7ad2
Merge branch 'main' into markjhoy/add_sparse_vector_token_pruning_ind…
markjhoy Jun 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/129089.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 129089
summary: Update `sparse_vector` field mapping to include default setting for token pruning
area: Mapping
type: enhancement
issues: []
59 changes: 59 additions & 0 deletions docs/reference/elasticsearch/mapping-reference/sparse-vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,33 @@ PUT my-index
}
```

## Token pruning
```{applies_to}
stack: preview 9.1
```

With any new indices created, token pruning will be turned on by default with appropriate defaults. You can control this behaviour using the optional `index_options` parameters for the field:

```console
PUT my-index
{
"mappings": {
"properties": {
"text.tokens": {
"type": "sparse_vector",
"index_options": {
"prune": true,
"pruning_config": {
"tokens_freq_ratio_threshold": 5,
"tokens_weight_threshold": 0.4
}
}
}
}
}
}
```

See [semantic search with ELSER](docs-content://solutions/search/semantic-search/semantic-search-elser-ingest-pipelines.md) for a complete example on adding documents to a `sparse_vector` mapped field using ELSER.

## Parameters for `sparse_vector` fields [sparse-vectors-params]
Expand All @@ -36,6 +63,38 @@ The following parameters are accepted by `sparse_vector` fields:
* Exclude the field from [_source](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering).
* Use [synthetic `_source`](/reference/elasticsearch/mapping-reference/mapping-source-field.md#synthetic-source).

index_options {applies_to}`stack: preview 9.1`
: (Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your [`sparse_vector` query](/reference/query-languages/query-dsl/query-dsl-sparse-vector-query.md), Elasticsearch will use the default options configured for the field, if any.

Parameters for `index_options` are:

`prune` {applies_to}`stack: preview 9.1`
: (Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true.

`pruning_config` {applies_to}`stack: preview 9.1`
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.

Parameters for `pruning_config` include:

`tokens_freq_ratio_threshold` {applies_to}`stack: preview 9.1`
: (Optional, integer) Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`.

`tokens_weight_threshold` {applies_to}`stack: preview 9.1`
: (Optional, float) Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`.

::::{note}
The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results.
::::

When token pruning is applied, non-significant tokens will be pruned from the query.
Non-significant tokens can be defined as tokens that meet both of the following criteria:
* The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.
* The weight/score is so low that the token is likely not very relevant to the original term

Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned.
This ensures that:
* The tokens that are kept are frequent enough and have significant scoring.
* Very infrequent tokens that may not have as high of a score are removed.


## Multi-value sparse vectors [index-multi-value-sparse-vectors]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ static TransportVersion def(int id) {
public static final TransportVersion ML_INFERENCE_CUSTOM_SERVICE_INPUT_TYPE_8_19 = def(8_841_0_55);
public static final TransportVersion RANDOM_SAMPLER_QUERY_BUILDER_8_19 = def(8_841_0_56);
public static final TransportVersion ML_INFERENCE_SAGEMAKER_ELASTIC_8_19 = def(8_841_0_57);

public static final TransportVersion SPARSE_VECTOR_FIELD_PRUNING_OPTIONS_8_19 = def(8_841_0_58);
public static final TransportVersion V_9_0_0 = def(9_000_0_09);
public static final TransportVersion INITIAL_ELASTICSEARCH_9_0_1 = def(9_000_0_10);
public static final TransportVersion INITIAL_ELASTICSEARCH_9_0_2 = def(9_000_0_11);
Expand Down Expand Up @@ -313,6 +313,7 @@ static TransportVersion def(int id) {
public static final TransportVersion STREAMS_LOGS_SUPPORT = def(9_104_0_00);
public static final TransportVersion ML_INFERENCE_CUSTOM_SERVICE_INPUT_TYPE = def(9_105_0_00);
public static final TransportVersion ML_INFERENCE_SAGEMAKER_ELASTIC = def(9_106_0_00);
public static final TransportVersion SPARSE_VECTOR_FIELD_PRUNING_OPTIONS = def(9_107_0_00);

/*
* STOP! READ THIS FIRST! No, really,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ private static Version parseUnchecked(String version) {
public static final IndexVersion INDEX_INT_SORT_INT_TYPE_8_19 = def(8_532_0_00, Version.LUCENE_9_12_1);
public static final IndexVersion MAPPER_TEXT_MATCH_ONLY_MULTI_FIELDS_DEFAULT_NOT_STORED_8_19 = def(8_533_0_00, Version.LUCENE_9_12_1);
public static final IndexVersion UPGRADE_TO_LUCENE_9_12_2 = def(8_534_0_00, Version.LUCENE_9_12_2);
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT_BACKPORT_8_X = def(8_535_0_00, Version.LUCENE_9_12_2);
public static final IndexVersion UPGRADE_TO_LUCENE_10_0_0 = def(9_000_0_00, Version.LUCENE_10_0_0);
public static final IndexVersion LOGSDB_DEFAULT_IGNORE_DYNAMIC_BEYOND_LIMIT = def(9_001_0_00, Version.LUCENE_10_0_0);
public static final IndexVersion TIME_BASED_K_ORDERED_DOC_ID = def(9_002_0_00, Version.LUCENE_10_0_0);
Expand Down Expand Up @@ -175,6 +176,7 @@ private static Version parseUnchecked(String version) {
public static final IndexVersion INDEX_INT_SORT_INT_TYPE = def(9_028_0_00, Version.LUCENE_10_2_1);
public static final IndexVersion MAPPER_TEXT_MATCH_ONLY_MULTI_FIELDS_DEFAULT_NOT_STORED = def(9_029_0_00, Version.LUCENE_10_2_1);
public static final IndexVersion UPGRADE_TO_LUCENE_10_2_2 = def(9_030_0_00, Version.LUCENE_10_2_2);
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT = def(9_031_0_00, Version.LUCENE_10_2_2);

/*
* STOP! READ THIS FIRST! No, really,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.RESCORE_VECTOR_QUANTIZED_VECTOR_MAPPING;
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.RESCORE_ZERO_VECTOR_QUANTIZED_VECTOR_MAPPING;
import static org.elasticsearch.index.mapper.vectors.DenseVectorFieldMapper.USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ;
import static org.elasticsearch.index.mapper.vectors.SparseVectorFieldMapper.SPARSE_VECTOR_INDEX_OPTIONS_FEATURE;

/**
* Spec for mapper-related features.
Expand Down Expand Up @@ -74,7 +75,8 @@ public Set<NodeFeature> getTestFeatures() {
USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ,
IVF_FORMAT_CLUSTER_FEATURE,
IVF_NESTED_SUPPORT,
SEARCH_LOAD_PER_SHARD
SEARCH_LOAD_PER_SHARD,
SPARSE_VECTOR_INDEX_OPTIONS_FEATURE
);
}
}
Loading
Loading