Skip to content

Commit 8036a08

Browse files
authored
Enable exclude_source_vectors by default for new indices (#131907)
This commit sets `index.mapping.exclude_source_vectors` to `true` by default for newly created indices. When enabled, vector fields (`dense_vector`, `sparse_vector`, `rank_vector`) are excluded from `_source` on disk and are not returned in API responses unless explicitly requested. The change improves indexing performance, reduces storage size, and avoids unnecessary payload bloat in responses. Vector values continue to be rehydrated transparently for partial updates, reindex, and recovery. Existing indices are not affected and continue to store vectors in `_source` by default.
1 parent b50b882 commit 8036a08

File tree

36 files changed

+503
-235
lines changed

36 files changed

+503
-235
lines changed

docs/changelog/131907.yaml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
pr: 131907
2+
summary: Enable `exclude_source_vectors` by default for new indices
3+
area: Vector Search
4+
type: breaking
5+
issues: []
6+
breaking:
7+
title: Enable `exclude_source_vectors` by default for new indices
8+
area: Search
9+
details: |-
10+
The `exclude_source_vectors` setting is now enabled by default for newly created indices.
11+
This means that vector fields (e.g., `dense_vector`) are no longer stored in the `_source` field
12+
by default, although they remain fully accessible through search and retrieval operations.
13+
14+
Instead of being persisted in `_source`, vectors are now rehydrated on demand from the underlying
15+
index structures when needed. This reduces index size and improves performance for typical vector
16+
search workloads where the original vector values do not need to be part of the `_source`.
17+
18+
If your use case requires vector fields to be stored in `_source`, you can disable this behavior by
19+
setting `exclude_source_vectors: false` at index creation time.
20+
impact: |-
21+
Vector fields will no longer be stored in `_source` by default for new indices. Applications or tools
22+
that expect to see vector fields in `_source` (for raw document inspection)
23+
may need to be updated or configured to explicitly retain vectors using `exclude_source_vectors: false`.
24+
25+
Retrieval of vector fields via search or the `_source` API remains fully supported.
26+
notable: true

docs/reference/elasticsearch/mapping-reference/dense-vector.md

Lines changed: 83 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,81 @@ PUT my-index-2
102102

103103
{{es}} uses the [HNSW algorithm](https://arxiv.org/abs/1603.09320) to support efficient kNN search. Like most kNN algorithms, HNSW is an approximate method that sacrifices result accuracy for improved speed.
104104

105+
## Accessing `dense_vector` fields in search responses
106+
```{applies_to}
107+
stack: ga 9.2
108+
serverless: ga
109+
```
110+
111+
By default, `dense_vector` fields are **not included in `_source`** in responses from the `_search`, `_msearch`, `_get`, and `_mget` APIs.
112+
This helps reduce response size and improve performance, especially in scenarios where vectors are used solely for similarity scoring and not required in the output.
113+
114+
To retrieve vector values explicitly, you can use:
115+
116+
* The `fields` option to request specific vector fields directly:
117+
118+
```console
119+
POST my-index-2/_search
120+
{
121+
"fields": ["my_vector"]
122+
}
123+
```
124+
125+
- The `_source.exclude_vectors` flag to re-enable vector inclusion in `_source` responses:
126+
127+
```console
128+
POST my-index-2/_search
129+
{
130+
"_source": {
131+
"exclude_vectors": false
132+
}
133+
}
134+
```
135+
136+
### Storage behavior and `_source`
137+
138+
By default, `dense_vector` fields are **not stored in `_source`** on disk. This is also controlled by the index setting `index.mapping.exclude_source_vectors`.
139+
This setting is enabled by default for newly created indices and can only be set at index creation time.
140+
141+
When enabled:
142+
143+
* `dense_vector` fields are removed from `_source` and the rest of the `_source` is stored as usual.
144+
* If a request includes `_source` and vector values are needed (e.g., during recovery or reindex), the vectors are rehydrated from their internal format.
145+
146+
This setting is compatible with synthetic `_source`, where the entire `_source` document is reconstructed from columnar storage. In full synthetic mode, no `_source` is stored on disk, and all fields — including vectors — are rebuilt when needed.
147+
148+
### Rehydration and precision
149+
150+
When vector values are rehydrated (e.g., for reindex, recovery, or explicit `_source` requests), they are restored from their internal format. Internally, vectors are stored at float precision, so if they were originally indexed as higher-precision types (e.g., `double` or `long`), the rehydrated values will have reduced precision. This lossy representation is intended to save space while preserving search quality.
151+
152+
### Storing original vectors in `_source`
153+
154+
If you want to preserve the original vector values exactly as they were provided, you can re-enable vector storage in `_source`:
155+
156+
```console
157+
PUT my-index-include-vectors
158+
{
159+
"settings": {
160+
"index.mapping.exclude_source_vectors": false
161+
},
162+
"mappings": {
163+
"properties": {
164+
"my_vector": {
165+
"type": "dense_vector"
166+
}
167+
}
168+
}
169+
}
170+
```
171+
172+
When this setting is disabled:
173+
174+
* `dense_vector` fields are stored as part of the `_source`, exactly as indexed.
175+
* The index will store both the original `_source` value and the internal representation used for vector search, resulting in increased storage usage.
176+
* Vectors are once again returned in `_source` by default in all relevant APIs, with no need to use `exclude_vectors` or `fields`.
177+
178+
This configuration is appropriate when full source fidelity is required, such as for auditing or round-tripping exact input values.
179+
105180
## Automatically quantize vectors for kNN search [dense-vector-quantization]
106181

107182
The `dense_vector` type supports quantization to reduce the memory footprint required when [searching](docs-content://solutions/search/vector/knn.md#approximate-knn) `float` vectors. The three following quantization strategies are supported:
@@ -266,16 +341,16 @@ $$$dense-vector-index-options$$$
266341
`type`
267342
: (Required, string) The type of kNN algorithm to use. Can be either any of:
268343
* `hnsw` - This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) for scalable approximate kNN search. This supports all `element_type` values.
269-
* `int8_hnsw` - The default index type for some float vectors:
270-
271-
* {applies_to}`stack: ga 9.1` Default for float vectors with less than 384 dimensions.
344+
* `int8_hnsw` - The default index type for some float vectors:
345+
346+
* {applies_to}`stack: ga 9.1` Default for float vectors with less than 384 dimensions.
272347
* {applies_to}`stack: ga 9.0` Default for float all vectors.
273-
348+
274349
This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically scalar quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 4x at the cost of some accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
275350
* `int4_hnsw` - This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically scalar quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 8x at the cost of some accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
276351
* `bbq_hnsw` - This utilizes the [HNSW algorithm](https://arxiv.org/abs/1603.09320) in addition to automatically binary quantization for scalable approximate kNN search with `element_type` of `float`. This can reduce the memory footprint by 32x at the cost of accuracy. See [Automatically quantize vectors for kNN search](#dense-vector-quantization).
277-
278-
{applies_to}`stack: ga 9.1` `bbq_hnsw` is the default index type for float vectors with greater than or equal to 384 dimensions.
352+
353+
{applies_to}`stack: ga 9.1` `bbq_hnsw` is the default index type for float vectors with greater than or equal to 384 dimensions.
279354
* `flat` - This utilizes a brute-force search algorithm for exact kNN search. This supports all `element_type` values.
280355
* `int8_flat` - This utilizes a brute-force search algorithm in addition to automatically scalar quantization. Only supports `element_type` of `float`.
281356
* `int4_flat` - This utilizes a brute-force search algorithm in addition to automatically half-byte scalar quantization. Only supports `element_type` of `float`.
@@ -295,8 +370,8 @@ $$$dense-vector-index-options$$$
295370
: (Optional, object) An optional section that configures automatic vector rescoring on knn queries for the given field. Only applicable to quantized index types.
296371
:::::{dropdown} Properties of rescore_vector
297372
`oversample`
298-
: (required, float) The amount to oversample the search results by. This value should be one of the following:
299-
* Greater than `1.0` and less than `10.0`
373+
: (required, float) The amount to oversample the search results by. This value should be one of the following:
374+
* Greater than `1.0` and less than `10.0`
300375
* Exactly `0` to indicate no oversampling and rescoring should occur {applies_to}`stack: ga 9.1`
301376
: The higher the value, the more vectors will be gathered and rescored with the raw values per shard.
302377
: In case a knn query specifies a `rescore_vector` parameter, the query `rescore_vector` parameter will be used instead.

docs/reference/elasticsearch/mapping-reference/rank-vectors.md

Lines changed: 72 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,11 +108,81 @@ $$$rank-vectors-element-type$$$
108108
`dims`
109109
: (Optional, integer) Number of vector dimensions. Can’t exceed `4096`. If `dims` is not specified, it will be set to the length of the first vector added to the field.
110110

111+
## Accessing `dense_vector` fields in search responses
112+
```{applies_to}
113+
stack: ga 9.2
114+
serverless: ga
115+
```
116+
117+
By default, `dense_vector` fields are **not included in `_source`** in responses from the `_search`, `_msearch`, `_get`, and `_mget` APIs.
118+
This helps reduce response size and improve performance, especially in scenarios where vectors are used solely for similarity scoring and not required in the output.
119+
120+
To retrieve vector values explicitly, you can use:
121+
122+
* The `fields` option to request specific vector fields directly:
123+
124+
```console
125+
POST my-index-2/_search
126+
{
127+
"fields": ["my_vector"]
128+
}
129+
```
130+
131+
- The `_source.exclude_vectors` flag to re-enable vector inclusion in `_source` responses:
132+
133+
```console
134+
POST my-index-2/_search
135+
{
136+
"_source": {
137+
"exclude_vectors": false
138+
}
139+
}
140+
```
141+
142+
### Storage behavior and `_source`
143+
144+
By default, `rank_vectors` fields are not stored in `_source` on disk. This is also controlled by the index setting `index.mapping.exclude_source_vectors`.
145+
This setting is enabled by default for newly created indices and can only be set at index creation time.
146+
147+
When enabled:
148+
149+
* `rank_vectors` fields are removed from `_source` and the rest of the `_source` is stored as usual.
150+
* If a request includes `_source` and vector values are needed (e.g., during recovery or reindex), the vectors are rehydrated from their internal format.
151+
152+
This setting is compatible with synthetic `_source`, where the entire `_source` document is reconstructed from columnar storage. In full synthetic mode, no `_source` is stored on disk, and all fields — including vectors — are rebuilt when needed.
153+
154+
### Rehydration and precision
155+
156+
When vector values are rehydrated (e.g., for reindex, recovery, or explicit `_source` requests), they are restored from their internal format. Internally, vectors are stored at float precision, so if they were originally indexed as higher-precision types (e.g., `double` or `long`), the rehydrated values will have reduced precision. This lossy representation is intended to save space while preserving search quality.
157+
158+
### Storing original vectors in `_source`
159+
160+
If you want to preserve the original vector values exactly as they were provided, you can re-enable vector storage in `_source`:
161+
162+
```console
163+
PUT my-index-include-vectors
164+
{
165+
"settings": {
166+
"index.mapping.exclude_source_vectors": false
167+
},
168+
"mappings": {
169+
"properties": {
170+
"my_vector": {
171+
"type": "rank_vectors",
172+
"dims": 128
173+
}
174+
}
175+
}
176+
}
177+
```
111178

112-
## Synthetic `_source` [rank-vectors-synthetic-source]
179+
When this setting is disabled:
113180

114-
`rank_vectors` fields support [synthetic `_source`](mapping-source-field.md#synthetic-source) .
181+
* `rank_vectors` fields are stored as part of the `_source`, exactly as indexed.
182+
* The index will store both the original `_source` value and the internal representation used for vector search, resulting in increased storage usage.
183+
* Vectors are once again returned in `_source` by default in all relevant APIs, with no need to use `exclude_vectors` or `fields`.
115184

185+
This configuration is appropriate when full source fidelity is required, such as for auditing or round-tripping exact input values.
116186

117187
## Scoring with rank vectors [rank-vectors-scoring]
118188

docs/reference/elasticsearch/mapping-reference/sparse-vector.md

Lines changed: 76 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -57,12 +57,6 @@ See [semantic search with ELSER](docs-content://solutions/search/semantic-search
5757

5858
The following parameters are accepted by `sparse_vector` fields:
5959

60-
[store](/reference/elasticsearch/mapping-reference/mapping-store.md)
61-
: Indicates whether the field value should be stored and retrievable independently of the [_source](/reference/elasticsearch/mapping-reference/mapping-source-field.md) field. Accepted values: true or false (default). The field’s data is stored using term vectors, a disk-efficient structure compared to the original JSON input. The input map can be retrieved during a search request via the [`fields` parameter](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#search-fields-param). To benefit from reduced disk usage, you must either:
62-
63-
* Exclude the field from [_source](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering).
64-
* Use [synthetic `_source`](/reference/elasticsearch/mapping-reference/mapping-source-field.md#synthetic-source).
65-
6660
index_options {applies_to}`stack: ga 9.1`
6761
: (Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your [`sparse_vector` query](/reference/query-languages/query-dsl/query-dsl-sparse-vector-query.md), Elasticsearch will use the default options configured for the field, if any.
6862

@@ -96,6 +90,82 @@ This ensures that:
9690
* The tokens that are kept are frequent enough and have significant scoring.
9791
* Very infrequent tokens that may not have as high of a score are removed.
9892

93+
## Accessing `dense_vector` fields in search responses
94+
```{applies_to}
95+
stack: ga 9.2
96+
serverless: ga
97+
```
98+
99+
By default, `dense_vector` fields are **not included in `_source`** in responses from the `_search`, `_msearch`, `_get`, and `_mget` APIs.
100+
This helps reduce response size and improve performance, especially in scenarios where vectors are used solely for similarity scoring and not required in the output.
101+
102+
To retrieve vector values explicitly, you can use:
103+
104+
* The `fields` option to request specific vector fields directly:
105+
106+
```console
107+
POST my-index-2/_search
108+
{
109+
"fields": ["my_vector"]
110+
}
111+
```
112+
113+
- The `_source.exclude_vectors` flag to re-enable vector inclusion in `_source` responses:
114+
115+
```console
116+
POST my-index-2/_search
117+
{
118+
"_source": {
119+
"exclude_vectors": false
120+
}
121+
}
122+
```
123+
124+
### Storage behavior and `_source`
125+
126+
By default, `sparse_vector` fields are not stored in `_source` on disk. This is also controlled by the index setting `index.mapping.exclude_source_vectors`.
127+
This setting is enabled by default for newly created indices and can only be set at index creation time.
128+
129+
When enabled:
130+
131+
* `sparse_vector` fields are removed from `_source` and the rest of the `_source` is stored as usual.
132+
* If a request includes `_source` and vector values are needed (e.g., during recovery or reindex), the vectors are rehydrated from their internal format.
133+
134+
This setting is compatible with synthetic `_source`, where the entire `_source` document is reconstructed from columnar storage. In full synthetic mode, no `_source` is stored on disk, and all fields — including vectors — are rebuilt when needed.
135+
136+
### Rehydration and precision
137+
138+
When vector values are rehydrated (e.g., for reindex, recovery, or explicit `_source` requests), they are restored from their internal format.
139+
Internally, vectors are stored as floats with 9 significant bits for the precision, so the rehydrated values will have reduced precision.
140+
This lossy representation is intended to save space while preserving search quality.
141+
142+
### Storing original vectors in `_source`
143+
144+
If you want to preserve the original vector values exactly as they were provided, you can re-enable vector storage in `_source`:
145+
146+
```console
147+
PUT my-index-include-vectors
148+
{
149+
"settings": {
150+
"index.mapping.exclude_source_vectors": false
151+
},
152+
"mappings": {
153+
"properties": {
154+
"my_vector": {
155+
"type": "sparse_vector"
156+
}
157+
}
158+
}
159+
}
160+
```
161+
162+
When this setting is disabled:
163+
164+
* `sparse_vector` fields are stored as part of the `_source`, exactly as indexed.
165+
* The index will store both the original `_source` value and the internal representation used for vector search, resulting in increased storage usage.
166+
* Vectors are once again returned in `_source` by default in all relevant APIs, with no need to use `exclude_vectors` or `fields`.
167+
168+
This configuration is appropriate when full source fidelity is required, such as for auditing or round-tripping exact input values.
99169

100170
## Multi-value sparse vectors [index-multi-value-sparse-vectors]
101171

modules/reindex/src/test/java/org/elasticsearch/reindex/ReindexBasicTests.java

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@
2323
import java.util.Map;
2424
import java.util.stream.Collectors;
2525

26-
import static org.elasticsearch.index.IndexSettings.SYNTHETIC_VECTORS;
2726
import static org.elasticsearch.index.query.QueryBuilders.termQuery;
2827
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
2928
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertHitCount;
@@ -182,14 +181,13 @@ public void testReindexFromComplexDateMathIndexName() throws Exception {
182181
}
183182

184183
public void testReindexIncludeVectors() throws Exception {
185-
assumeTrue("This test requires synthetic vectors to be enabled", SYNTHETIC_VECTORS);
186184
var resp1 = prepareCreate("test").setSettings(
187-
Settings.builder().put(IndexSettings.INDEX_MAPPING_SOURCE_SYNTHETIC_VECTORS_SETTING.getKey(), true).build()
185+
Settings.builder().put(IndexSettings.INDEX_MAPPING_EXCLUDE_SOURCE_VECTORS_SETTING.getKey(), true).build()
188186
).setMapping("foo", "type=dense_vector,similarity=l2_norm", "bar", "type=sparse_vector").get();
189187
assertAcked(resp1);
190188

191189
var resp2 = prepareCreate("test_reindex").setSettings(
192-
Settings.builder().put(IndexSettings.INDEX_MAPPING_SOURCE_SYNTHETIC_VECTORS_SETTING.getKey(), true).build()
190+
Settings.builder().put(IndexSettings.INDEX_MAPPING_EXCLUDE_SOURCE_VECTORS_SETTING.getKey(), true).build()
193191
).setMapping("foo", "type=dense_vector,similarity=l2_norm", "bar", "type=sparse_vector").get();
194192
assertAcked(resp2);
195193

@@ -237,5 +235,4 @@ public void testReindexIncludeVectors() throws Exception {
237235
searchResponse.decRef();
238236
}
239237
}
240-
241238
}

modules/reindex/src/test/java/org/elasticsearch/reindex/UpdateByQueryBasicTests.java

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
import java.util.Map;
2525
import java.util.stream.Collectors;
2626

27-
import static org.elasticsearch.index.IndexSettings.SYNTHETIC_VECTORS;
2827
import static org.elasticsearch.index.query.QueryBuilders.termQuery;
2928
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
3029
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertHitCount;
@@ -158,9 +157,8 @@ public void testMissingSources() {
158157
}
159158

160159
public void testUpdateByQueryIncludeVectors() throws Exception {
161-
assumeTrue("This test requires synthetic vectors to be enabled", SYNTHETIC_VECTORS);
162160
var resp1 = prepareCreate("test").setSettings(
163-
Settings.builder().put(IndexSettings.INDEX_MAPPING_SOURCE_SYNTHETIC_VECTORS_SETTING.getKey(), true).build()
161+
Settings.builder().put(IndexSettings.INDEX_MAPPING_EXCLUDE_SOURCE_VECTORS_SETTING.getKey(), true).build()
164162
).setMapping("foo", "type=dense_vector,similarity=l2_norm", "bar", "type=sparse_vector").get();
165163
assertAcked(resp1);
166164

0 commit comments

Comments
 (0)