You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add documentation for multi-vector and nested-vector field support (#2261)
- With `[email protected]` we have introduced support to retrieve documents
containing multiple vectors for a field. Added documentation for the
same.
- Fix markdown lint errors
---------
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Abhinav Dangeti <[email protected]>
Copy file name to clipboardExpand all lines: docs/index_update.md
+8-4Lines changed: 8 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,14 +10,16 @@ While opening an index, if an updated mapping is provided as a string under the
10
10
If the update fails, the index is unchanged and an error is returned explaining why the update was unsuccessful.
11
11
12
12
## What can be deleted and what can't be deleted?
13
+
13
14
Fields can be partially deleted by changing their Index, Store, and DocValues parameters from true to false, or completely removed by deleting the field itself.
14
15
15
16
Additionally, document mappings can be deleted either by fully removing them from the index mapping or by setting the Enabled value to false, which deletes all fields defined within that mapping.
16
17
17
18
However, if any of the following conditions are met, the index is considered non-updatable.
19
+
18
20
* Any additional fields or enabled document mappings in the new index mapping
19
21
* Any changes to IncludeInAll, type, IncludeTermVectors and SkipFreqNorm
20
-
* Any document mapping having it's enabled value changing from false to true
22
+
* Any document mapping having its enabled value changing from false to true
21
23
* Text fields with a different analyser or date time fields with a different date time format
22
24
* Vector and VectorBase64 fields changing dims, similarity or vectorIndexOptimizedFor
23
25
* Any changes when field is part of `_all`
@@ -26,15 +28,17 @@ However, if any of the following conditions are met, the index is considered non
26
28
* If multiple fields sharing the same field name either from different type mappings or aliases are present, then any non compatible changes across all of these fields
27
29
28
30
## How to enforce immediate deletion?
31
+
29
32
Since the deletion is only done during merging, a [force merge](https://github.com/blevesearch/bleve/blob/b82baf10b205511cf12da5cb24330abd9f5b1b74/index/scorch/merge.go#L164) may be used to completely remove the stale data.
30
33
31
34
## Sample code to update an existing index
32
-
```
35
+
36
+
```go
33
37
newMapping:=`<Updated Index Mapping>`
34
38
config:=map[string]interface{}{
35
-
"updated_mapping": newMapping
39
+
"updated_mapping": newMapping,
36
40
}
37
-
index, err := OpenUsing("<Path to Index>", config)
41
+
index, err:=bleve.OpenUsing("<Path to Index>", config)
Copy file name to clipboardExpand all lines: docs/pagination.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Why pagination matters
4
4
5
-
Search queries can match many documents. Pagination lets you fetch and display results in chunks, keeping responses small and fast.
5
+
Search queries can match many documents. Pagination lets you fetch and display results in chunks, keeping responses small and fast.
6
6
7
7
By default, Bleve returns the first 10 hits sorted by relevance (score), highest first.
8
8
@@ -48,7 +48,7 @@ Rules:
48
48
49
49
Where do sort keys come from?
50
50
51
-
- Each hit includes `Sort` (and `DecodedSort` from Bleve v2.5.2). Take the last hit’s sort keys for `SearchAfter`, or the first hit’s sort keys for `SearchBefore`.
51
+
- Each hit includes `Sort` (and `DecodedSort` from Bleve v2.5.2). Take the last hit's sort keys for `SearchAfter`, or the first hit's sort keys for `SearchBefore`.
52
52
- If the field/fields to be searched over is numeric, datetime or geo, the values in the `Sort` field may have garbled values; this is because of how Bleve represents such data types internally. To use such fields as sort keys, use the `DecodedSort` field, which decodes the internal representations. This feature is available from Bleve v2.5.4.
53
53
54
54
> When using `DecodedSort`, the `Sort` array in the search request needs to explicitly declare the type of the field for proper decoding. Hence, the `Sort` array must contain either `SortField` objects (for numeric and datetime) or `SortGeoDistance` objects (for geo) rather than just the field names. More info on `SortField` and `SortGeoDistance` can be found in [sort_facet.md](sort_facet.md).
@@ -76,6 +76,7 @@ Backward pagination over `_id` and `_score`:
76
76
```
77
77
78
78
Pagination using numeric, datetime and geo fields. Notice how we specify the sort objects, with the "type" field explicitly declared in case of numeric and datetime:
79
+
79
80
```json
80
81
{
81
82
"query": {
@@ -89,8 +90,8 @@ Pagination using numeric, datetime and geo fields. Notice how we specify the sor
Copy file name to clipboardExpand all lines: docs/persister.md
+28-27Lines changed: 28 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Memory Management
4
4
5
-
When data is indexed in Scorch — using either the `index.Index()` or `index.Batch()` API — it is added as part of an in-memory "segment". Memory management in Scorch indexing mainly relates to handling these in-memory segments during workloads that involve inserts or updates.
5
+
When data is indexed in Scorch — using either the `index.Index()` or `index.Batch()` API — it is added as part of an in-memory "segment". Memory management in Scorch indexing mainly relates to handling these in-memory segments during workloads that involve inserts or updates.
6
6
7
7
In scenarios with a continuous stream of incoming data, a large number of in-memory segments can accumulate over time. This is where the persister component comes into play—its job is to flush these in-memory segments to disk.
8
8
@@ -11,46 +11,47 @@ Starting with v2.5.0, Scorch supports parallel flushing of in-memory segments to
11
11
-`NumPersisterWorkers`: This factor decides how many maximum workers can be spawned to flush out the in-memory segments. Each worker will work on a disjoint subset of segments, merge them, and flush them out to the disk. By default the persister deploys only one worker.
12
12
-`MaxSizeInMemoryMergePerWorker`: This config decides what's the maximum amount of input data in bytes a single worker can work upon. By default this value is equal to 0 which means that this config is disabled and the worker tries to merge all the data in one shot. Also note that it's imperative that the user set this config if `NumPersisterWorkers > 1`.
13
13
14
-
If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behaviour — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive.
15
-
- Tuning this config is very dependent on the available CPU resources, and something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high.
14
+
If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behaviour — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive.
16
15
17
-
Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behaviour in terms of I/O, although it comes at the cost of time.
18
-
- Changing this config is usecase dependent, for example in usecases where the payload or per doc size is generally large in size (for eg vector usecases), it would be beneficial to have a larger value for this.
16
+
- Tuning this config is very dependent on the available CPU resources, and something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high.
19
17
20
-
So, having the ideal values for these two configs is definitely dependent on the use case and can involve a bunch of experiments, keeping the resource usage in mind.
18
+
Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behaviour in terms of I/O, although it comes at the cost of time.
21
19
20
+
- Changing this config is usecase dependent, for example in usecases where the payload or per doc size is generally large in size (for eg vector usecases), it would be beneficial to have a larger value for this.
21
+
22
+
So, having the ideal values for these two configs is definitely dependent on the use case and can involve a bunch of experiments, keeping the resource usage in mind.
22
23
23
24
## File Management
24
25
25
-
The persister introducing some number of file segments into the system would change the state of the system, and the merger would wake up and try to manage these on-disk files.
26
+
The persister introducing some number of file segments into the system would change the state of the system, and the merger would wake up and try to manage these on-disk files.
26
27
27
-
Management of these files is crucial when it comes to query latency because a higher number of files would dictate searching through a larger number of files and also higher read amplification to some extent, because the backing data structures can potentially be compacted in size across files.
28
+
Management of these files is crucial when it comes to query latency because a higher number of files would dictate searching through a larger number of files and also higher read amplification to some extent, because the backing data structures can potentially be compacted in size across files.
28
29
29
-
The merger sees the files on disk and plans out which segments to merge so that the final layout of segment tiers (each tier having multiple files), which grow in a logarithmic way (the chances of larger tiers growing in number would decrease), is maintained. This also implies that deciding this first-tier size becomes important in deciding the number of segment files across all tiers.
30
+
The merger sees the files on disk and plans out which segments to merge so that the final layout of segment tiers (each tier having multiple files), which grow in a logarithmic way (the chances of larger tiers growing in number would decrease), is maintained. This also implies that deciding this first-tier size becomes important in deciding the number of segment files across all tiers.
30
31
31
-
Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behaviour is in line with the use case and aware of the payload/doc size.
32
-
- This config can also be tuned to dictate how the I/O behaviour should be within an index. While tuning this config, it should be in proportion to the `MaxSizeInMemoryMergePerWorker` since that dictates the amount of data flushed out per flush.
33
-
- The observation here is that `FloorSegmentFileSize` is lesser than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`.
32
+
Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behaviour is in line with the use case and aware of the payload/doc size.
34
33
34
+
- This config can also be tuned to dictate how the I/O behaviour should be within an index. While tuning this config, it should be in proportion to the `MaxSizeInMemoryMergePerWorker` since that dictates the amount of data flushed out per flush.
35
+
- The observation here is that `FloorSegmentFileSize` is lesser than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`.
35
36
36
37
## Setting a Persister/Merger Config in Index
37
38
38
39
The configs are set via the `kvConfig` parameter in the `NewUsing()` or `OpenUsing()` API:
0 commit comments