Skip to content

Commit ac55c2c

Browse files
CascadingRadiumCopilotabhinavdangeti
authored
Add documentation for multi-vector and nested-vector field support (#2261)
- With `[email protected]` we have introduced support to retrieve documents containing multiple vectors for a field. Added documentation for the same. - Fix markdown lint errors --------- Co-authored-by: Copilot <[email protected]> Co-authored-by: Abhinav Dangeti <[email protected]>
1 parent 4eb144e commit ac55c2c

File tree

5 files changed

+207
-69
lines changed

5 files changed

+207
-69
lines changed

docs/index_update.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,16 @@ While opening an index, if an updated mapping is provided as a string under the
1010
If the update fails, the index is unchanged and an error is returned explaining why the update was unsuccessful.
1111

1212
## What can be deleted and what can't be deleted?
13+
1314
Fields can be partially deleted by changing their Index, Store, and DocValues parameters from true to false, or completely removed by deleting the field itself.
1415

1516
Additionally, document mappings can be deleted either by fully removing them from the index mapping or by setting the Enabled value to false, which deletes all fields defined within that mapping.
1617

1718
However, if any of the following conditions are met, the index is considered non-updatable.
19+
1820
* Any additional fields or enabled document mappings in the new index mapping
1921
* Any changes to IncludeInAll, type, IncludeTermVectors and SkipFreqNorm
20-
* Any document mapping having it's enabled value changing from false to true
22+
* Any document mapping having its enabled value changing from false to true
2123
* Text fields with a different analyser or date time fields with a different date time format
2224
* Vector and VectorBase64 fields changing dims, similarity or vectorIndexOptimizedFor
2325
* Any changes when field is part of `_all`
@@ -26,15 +28,17 @@ However, if any of the following conditions are met, the index is considered non
2628
* If multiple fields sharing the same field name either from different type mappings or aliases are present, then any non compatible changes across all of these fields
2729

2830
## How to enforce immediate deletion?
31+
2932
Since the deletion is only done during merging, a [force merge](https://github.com/blevesearch/bleve/blob/b82baf10b205511cf12da5cb24330abd9f5b1b74/index/scorch/merge.go#L164) may be used to completely remove the stale data.
3033

3134
## Sample code to update an existing index
32-
```
35+
36+
```go
3337
newMapping := `<Updated Index Mapping>`
3438
config := map[string]interface{}{
35-
"updated_mapping": newMapping
39+
"updated_mapping": newMapping,
3640
}
37-
index, err := OpenUsing("<Path to Index>", config)
41+
index, err := bleve.OpenUsing("<Path to Index>", config)
3842
if err != nil {
3943
return err
4044
}

docs/pagination.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Why pagination matters
44

5-
Search queries can match many documents. Pagination lets you fetch and display results in chunks, keeping responses small and fast.
5+
Search queries can match many documents. Pagination lets you fetch and display results in chunks, keeping responses small and fast.
66

77
By default, Bleve returns the first 10 hits sorted by relevance (score), highest first.
88

@@ -48,7 +48,7 @@ Rules:
4848

4949
Where do sort keys come from?
5050

51-
- Each hit includes `Sort` (and `DecodedSort` from Bleve v2.5.2). Take the last hits sort keys for `SearchAfter`, or the first hits sort keys for `SearchBefore`.
51+
- Each hit includes `Sort` (and `DecodedSort` from Bleve v2.5.2). Take the last hit's sort keys for `SearchAfter`, or the first hit's sort keys for `SearchBefore`.
5252
- If the field/fields to be searched over is numeric, datetime or geo, the values in the `Sort` field may have garbled values; this is because of how Bleve represents such data types internally. To use such fields as sort keys, use the `DecodedSort` field, which decodes the internal representations. This feature is available from Bleve v2.5.4.
5353

5454
> When using `DecodedSort`, the `Sort` array in the search request needs to explicitly declare the type of the field for proper decoding. Hence, the `Sort` array must contain either `SortField` objects (for numeric and datetime) or `SortGeoDistance` objects (for geo) rather than just the field names. More info on `SortField` and `SortGeoDistance` can be found in [sort_facet.md](sort_facet.md).
@@ -76,6 +76,7 @@ Backward pagination over `_id` and `_score`:
7676
```
7777

7878
Pagination using numeric, datetime and geo fields. Notice how we specify the sort objects, with the "type" field explicitly declared in case of numeric and datetime:
79+
7980
```json
8081
{
8182
"query": {
@@ -89,8 +90,8 @@ Pagination using numeric, datetime and geo fields. Notice how we specify the sor
8990
],
9091
"search_after": ["99.99", "2023-10-15T10:30:00Z", "5.2"]
9192
}
92-
9393
```
94+
9495
## Total Sort Order
9596

9697
Pagination is deterministic. Ensure your `Sort` defines a total order, so that documents with the same sort keys are not left out:
@@ -105,4 +106,4 @@ Pagination is deterministic. Ensure your `Sort` defines a total order, so that d
105106

106107
- Offset pagination cost grows with `From` (collects at least `Size + From` results before slicing).
107108
- `SearchAfter`/`SearchBefore` keeps memory and network proportional to `Size`.
108-
- For large datasets and deep navigation, prefer using `SearchAfter` and `SearchBefore`.
109+
- For large datasets and deep navigation, prefer using `SearchAfter` and `SearchBefore`.

docs/persister.md

Lines changed: 28 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Memory Management
44

5-
When data is indexed in Scorch — using either the `index.Index()` or `index.Batch()` API — it is added as part of an in-memory "segment". Memory management in Scorch indexing mainly relates to handling these in-memory segments during workloads that involve inserts or updates.
5+
When data is indexed in Scorch — using either the `index.Index()` or `index.Batch()` API — it is added as part of an in-memory "segment". Memory management in Scorch indexing mainly relates to handling these in-memory segments during workloads that involve inserts or updates.
66

77
In scenarios with a continuous stream of incoming data, a large number of in-memory segments can accumulate over time. This is where the persister component comes into play—its job is to flush these in-memory segments to disk.
88

@@ -11,46 +11,47 @@ Starting with v2.5.0, Scorch supports parallel flushing of in-memory segments to
1111
- `NumPersisterWorkers`: This factor decides how many maximum workers can be spawned to flush out the in-memory segments. Each worker will work on a disjoint subset of segments, merge them, and flush them out to the disk. By default the persister deploys only one worker.
1212
- `MaxSizeInMemoryMergePerWorker`: This config decides what's the maximum amount of input data in bytes a single worker can work upon. By default this value is equal to 0 which means that this config is disabled and the worker tries to merge all the data in one shot. Also note that it's imperative that the user set this config if `NumPersisterWorkers > 1`.
1313

14-
If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behaviour — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive.
15-
- Tuning this config is very dependent on the available CPU resources, and something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high.
14+
If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behaviour — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive.
1615

17-
Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behaviour in terms of I/O, although it comes at the cost of time.
18-
- Changing this config is usecase dependent, for example in usecases where the payload or per doc size is generally large in size (for eg vector usecases), it would be beneficial to have a larger value for this.
16+
- Tuning this config is very dependent on the available CPU resources, and something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high.
1917

20-
So, having the ideal values for these two configs is definitely dependent on the use case and can involve a bunch of experiments, keeping the resource usage in mind.
18+
Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behaviour in terms of I/O, although it comes at the cost of time.
2119

20+
- Changing this config is usecase dependent, for example in usecases where the payload or per doc size is generally large in size (for eg vector usecases), it would be beneficial to have a larger value for this.
21+
22+
So, having the ideal values for these two configs is definitely dependent on the use case and can involve a bunch of experiments, keeping the resource usage in mind.
2223

2324
## File Management
2425

25-
The persister introducing some number of file segments into the system would change the state of the system, and the merger would wake up and try to manage these on-disk files.
26+
The persister introducing some number of file segments into the system would change the state of the system, and the merger would wake up and try to manage these on-disk files.
2627

27-
Management of these files is crucial when it comes to query latency because a higher number of files would dictate searching through a larger number of files and also higher read amplification to some extent, because the backing data structures can potentially be compacted in size across files.
28+
Management of these files is crucial when it comes to query latency because a higher number of files would dictate searching through a larger number of files and also higher read amplification to some extent, because the backing data structures can potentially be compacted in size across files.
2829

29-
The merger sees the files on disk and plans out which segments to merge so that the final layout of segment tiers (each tier having multiple files), which grow in a logarithmic way (the chances of larger tiers growing in number would decrease), is maintained. This also implies that deciding this first-tier size becomes important in deciding the number of segment files across all tiers.
30+
The merger sees the files on disk and plans out which segments to merge so that the final layout of segment tiers (each tier having multiple files), which grow in a logarithmic way (the chances of larger tiers growing in number would decrease), is maintained. This also implies that deciding this first-tier size becomes important in deciding the number of segment files across all tiers.
3031

31-
Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behaviour is in line with the use case and aware of the payload/doc size.
32-
- This config can also be tuned to dictate how the I/O behaviour should be within an index. While tuning this config, it should be in proportion to the `MaxSizeInMemoryMergePerWorker` since that dictates the amount of data flushed out per flush.
33-
- The observation here is that `FloorSegmentFileSize` is lesser than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`.
32+
Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behaviour is in line with the use case and aware of the payload/doc size.
3433

34+
- This config can also be tuned to dictate how the I/O behaviour should be within an index. While tuning this config, it should be in proportion to the `MaxSizeInMemoryMergePerWorker` since that dictates the amount of data flushed out per flush.
35+
- The observation here is that `FloorSegmentFileSize` is lesser than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`.
3536

3637
## Setting a Persister/Merger Config in Index
3738

3839
The configs are set via the `kvConfig` parameter in the `NewUsing()` or `OpenUsing()` API:
3940

4041
```go
41-
// setting the persister and merger configs
42-
kvConfig := map[string]interface{}{
43-
"scorchPersisterOptions": map[string]interface{}{
44-
"NumPersisterWorkers": 4,
45-
"MaxSizeInMemoryMergePerWorker": 20000000,
46-
},
47-
"scorchMergePlanOptions": map[string]interface{}{
48-
"FloorSegmentFileSize": 10000000,
49-
},
50-
}
51-
// passing the config to the index
52-
index, err := bleve.NewUsing("example.bleve", bleve.NewIndexMapping(), bleve.Config.DefaultIndexType, bleve.Config.DefaultMemKVStore, kvConfig)
53-
if err != nil {
54-
panic(err)
55-
}
42+
// setting the persister and merger configs
43+
kvConfig := map[string]interface{}{
44+
"scorchPersisterOptions": map[string]interface{}{
45+
"NumPersisterWorkers": 4,
46+
"MaxSizeInMemoryMergePerWorker": 20000000,
47+
},
48+
"scorchMergePlanOptions": map[string]interface{}{
49+
"FloorSegmentFileSize": 10000000,
50+
},
51+
}
52+
// passing the config to the index
53+
index, err := bleve.NewUsing("example.bleve", bleve.NewIndexMapping(), bleve.Config.DefaultIndexType, bleve.Config.DefaultMemKVStore, kvConfig)
54+
if err != nil {
55+
panic(err)
56+
}
5657
```

docs/score_fusion.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ RRF\_score = w_{\text{fts}} \cdot \frac{1}{k + \text{rank}_{\text{fts}}} + \sum_
2828
```
2929

3030
Where:
31+
3132
* $\text{rank}_{\text{fts}}$: 1-indexed rank of the document in the FTS result list (or 0 if not present)
3233
* $\text{rank}_{\text{knn}_i}$: 1-indexed rank of the document in the i-th kNN result list (or 0 if not present)
3334
* $k$: rank constant (default: 60) that dampens the impact of rank differences
@@ -36,12 +37,14 @@ Where:
3637
* $\sum_{i=1}^{n}$: summation over all kNN queries (you can add multiple kNN queries)
3738

3839
**Advantages:**
39-
* Distribution-agnostic – no need for score normalization
40+
41+
* Distribution-agnostic - no need for score normalization
4042
* Works out of the box with minimal tuning
4143
* Prioritizes documents appearing in both result lists
4244
* Robust to outliers since only ranks matter
4345

4446
**Disadvantages:**
47+
4548
* Ignores score magnitude (loses some information)
4649
* May be sensitive to imbalanced result list sizes
4750

@@ -86,28 +89,31 @@ Relative Score Fusion is a **score-based** strategy that normalizes scores from
8689

8790
1. **Min-max normalize** each result set independently:
8891

89-
```math
90-
\text{normalized\_score} = \frac{\text{score} - \text{min\_score}}{\text{max\_score} - \text{min\_score}}
91-
```
92+
```math
93+
\text{normalized\_score} = \frac{\text{score} - \text{min\_score}}{\text{max\_score} - \text{min\_score}}
94+
```
9295
9396
2. **Combine** normalized scores using weighted addition:
9497
95-
```math
96-
RSF\_score = w_{\text{fts}} \cdot \text{normalized\_score\_fts} + \sum_{i=1}^{n} w_{\text{knn}_i} \cdot \text{normalized\_score\_knn}_i
97-
```
98+
```math
99+
RSF\_score = w_{\text{fts}} \cdot \text{normalized\_score\_fts} + \sum_{i=1}^{n} w_{\text{knn}_i} \cdot \text{normalized\_score\_knn}_i
100+
```
98101
99102
Where:
103+
100104
* $w_{\text{fts}}$: weight from the FTS query boost value
101105
* $w_{\text{knn}_i}$: weight from the i-th kNN query boost value
102106
* $\sum_{i=1}^{n}$: summation over all kNN queries (you can add multiple kNN queries)
103107
104108
**Advantages:**
105-
* Score-aware – retains relevance magnitude information
109+
110+
* Score-aware - retains relevance magnitude information
106111
* Resolves incompatible score ranges
107112
* Easy to understand
108113
109114
**Disadvantages:**
110-
* Sensitive to outliers – a single extreme score can skew normalization
115+
116+
* Sensitive to outliers - a single extreme score can skew normalization
111117
* Doesn't account for the shape or distribution of scores
112118
113119
**Usage:**
@@ -171,6 +177,7 @@ From + Size <= ScoreWindowSize
171177
```
172178

173179
**Example:**
180+
174181
```json
175182
{
176183
"score": "rrf",
@@ -195,6 +202,7 @@ With window size set to 150, you can paginate through up to 150 results. If you
195202
* **Effect**: Higher values dampen the impact of rank differences
196203

197204
**Example:**
205+
198206
```json
199207
{
200208
"score": "rrf",

0 commit comments

Comments
 (0)