Add documentation for multi-vector and nested-vector field support (#2261)

CascadingRadium · Copilot · abhinavdangeti · web-flow · commit ac55c2c64f64 · 2025-12-10T11:41:29.000-07:00
- With `bleve@2.5.7` we have introduced support to retrieve documents
containing multiple vectors for a field. Added documentation for the
same.
- Fix markdown lint errors

---------

Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;
Co-authored-by: Abhinav Dangeti &lt;abhinav@couchbase.com&gt;
diff --git a/docs/index_update.md b/docs/index_update.md
@@ -10,14 +10,16 @@ While opening an index, if an updated mapping is provided as a string under the
 If the update fails, the index is unchanged and an error is returned explaining why the update was unsuccessful.
 
 ## What can be deleted and what can't be deleted?
+
 Fields can be partially deleted by changing their Index, Store, and DocValues parameters from true to false, or completely removed by deleting the field itself.
 
 Additionally, document mappings can be deleted either by fully removing them from the index mapping or by setting the Enabled value to false, which deletes all fields defined within that mapping.
 
 However, if any of the following conditions are met, the index is considered non-updatable.
+
 * Any additional fields or enabled document mappings in the new index mapping
 * Any changes to IncludeInAll, type, IncludeTermVectors and SkipFreqNorm
-* Any document mapping having it's enabled value changing from false to true
+* Any document mapping having its enabled value changing from false to true
 * Text fields with a different analyser or date time fields with a different date time format
 * Vector and VectorBase64 fields changing dims, similarity or vectorIndexOptimizedFor
 * Any changes when field is part of `_all`
@@ -26,15 +28,17 @@ However, if any of the following conditions are met, the index is considered non
 * If multiple fields sharing the same field name either from different type mappings or aliases are present, then any non compatible changes across all of these fields
 
 ## How to enforce immediate deletion?
+
 Since the deletion is only done during merging, a [force merge](https://github.com/blevesearch/bleve/blob/b82baf10b205511cf12da5cb24330abd9f5b1b74/index/scorch/merge.go#L164) may be used to completely remove the stale data.
 
 ## Sample code to update an existing index
-```
+
+```go
 newMapping := `<Updated Index Mapping>`
 config := map[string]interface{}{
-    "updated_mapping": newMapping
+    "updated_mapping": newMapping,
 }
-index, err := OpenUsing("<Path to Index>", config)
+index, err := bleve.OpenUsing("<Path to Index>", config)
 if err != nil {
     return err
 }
diff --git a/docs/pagination.md b/docs/pagination.md
@@ -2,7 +2,7 @@
 
 ## Why pagination matters
 
-Search queries can match many documents. Pagination lets you fetch and display results in chunks, keeping responses small and fast. 
+Search queries can match many documents. Pagination lets you fetch and display results in chunks, keeping responses small and fast.
 
 By default, Bleve returns the first 10 hits sorted by relevance (score), highest first.
 
@@ -48,7 +48,7 @@ Rules:
 
 Where do sort keys come from?
 
-- Each hit includes `Sort` (and `DecodedSort` from Bleve v2.5.2). Take the last hit’s sort keys for `SearchAfter`, or the first hit’s sort keys for `SearchBefore`.
+- Each hit includes `Sort` (and `DecodedSort` from Bleve v2.5.2). Take the last hit's sort keys for `SearchAfter`, or the first hit's sort keys for `SearchBefore`.
 - If the field/fields to be searched over is numeric, datetime or geo, the values in the `Sort` field may have garbled values; this is because of how Bleve represents such data types internally. To use such fields as sort keys, use the `DecodedSort` field, which decodes the internal representations. This feature is available from Bleve v2.5.4.
 
 > When using `DecodedSort`, the `Sort` array in the search request needs to explicitly declare the type of the field for proper decoding. Hence, the `Sort` array must contain either `SortField` objects (for numeric and datetime) or `SortGeoDistance` objects (for geo) rather than just the field names. More info on `SortField` and `SortGeoDistance` can be found in [sort_facet.md](sort_facet.md).
@@ -76,6 +76,7 @@ Backward pagination over `_id` and `_score`:
 ```
 
 Pagination using numeric, datetime and geo fields. Notice how we specify the sort objects, with the "type" field explicitly declared in case of numeric and datetime:
+
 ```json
 {
   "query": {
@@ -89,8 +90,8 @@ Pagination using numeric, datetime and geo fields. Notice how we specify the sor
   ],
   "search_after": ["99.99", "2023-10-15T10:30:00Z", "5.2"]
 }
-
 ```
+
 ## Total Sort Order
 
 Pagination is deterministic. Ensure your `Sort` defines a total order, so that documents with the same sort keys are not left out:
@@ -105,4 +106,4 @@ Pagination is deterministic. Ensure your `Sort` defines a total order, so that d
 
 - Offset pagination cost grows with `From` (collects at least `Size + From` results before slicing).
 - `SearchAfter`/`SearchBefore` keeps memory and network proportional to `Size`.
-- For large datasets and deep navigation, prefer using `SearchAfter` and `SearchBefore`.
+- For large datasets and deep navigation, prefer using `SearchAfter` and `SearchBefore`.
diff --git a/docs/persister.md b/docs/persister.md
@@ -2,7 +2,7 @@
 
 ## Memory Management
 
-When data is indexed in Scorch — using either the `index.Index()` or `index.Batch()` API — it is added as part of an in-memory "segment". Memory management in Scorch indexing mainly relates to handling these in-memory segments during workloads that involve inserts or updates. 
+When data is indexed in Scorch — using either the `index.Index()` or `index.Batch()` API — it is added as part of an in-memory "segment". Memory management in Scorch indexing mainly relates to handling these in-memory segments during workloads that involve inserts or updates.
 
 In scenarios with a continuous stream of incoming data, a large number of in-memory segments can accumulate over time. This is where the persister component comes into play—its job is to flush these in-memory segments to disk.
 
@@ -11,46 +11,47 @@ Starting with v2.5.0, Scorch supports parallel flushing of in-memory segments to
 - `NumPersisterWorkers`: This factor decides how many maximum workers can be spawned to flush out the in-memory segments. Each worker will work on a disjoint subset of segments, merge them, and flush them out to the disk. By default the persister deploys only one worker.
 - `MaxSizeInMemoryMergePerWorker`: This config decides what's the maximum amount of input data in bytes a single worker can work upon. By default this value is equal to 0 which means that this config is disabled and the worker tries to merge all the data in one shot. Also note that it's imperative that the user set this config if `NumPersisterWorkers > 1`.
 
-If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behaviour — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive. 
- - Tuning this config is very dependent on the available CPU resources, and something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high.
+If the index is tuned to have a higher `NumPersisterWorkers` value, the memory can potentially drain out faster and ensure stronger consistency behaviour — but there would be a lot of on-disk files, and the background merger would experience the pressure of managing this large number of files, which can be resource-intensive.
 
-Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behaviour in terms of I/O, although it comes at the cost of time. 
-- Changing this config is usecase dependent, for example in usecases where the payload or per doc size is generally large in size (for eg vector usecases), it would be beneficial to have a larger value for this. 
+- Tuning this config is very dependent on the available CPU resources, and something to keep in mind here is that the process's RSS can increase if the number of workers — and each of them working upon a large amount of data — is high.
 
-So, having the ideal values for these two configs is definitely dependent on the use case and can involve a bunch of experiments, keeping the resource usage in mind. 
+Increasing the `MaxSizeInMemoryMergePerWorker` value would mean that each worker acts upon a larger amount of data and spends more time merging and flushing it out to disk — which can be healthy behaviour in terms of I/O, although it comes at the cost of time.
 
+- Changing this config is usecase dependent, for example in usecases where the payload or per doc size is generally large in size (for eg vector usecases), it would be beneficial to have a larger value for this.
+
+So, having the ideal values for these two configs is definitely dependent on the use case and can involve a bunch of experiments, keeping the resource usage in mind.
 
 ## File Management
 
-The persister introducing some number of file segments into the system would change the state of the system, and the merger would wake up and try to manage these on-disk files. 
+The persister introducing some number of file segments into the system would change the state of the system, and the merger would wake up and try to manage these on-disk files.
 
-Management of these files is crucial when it comes to query latency because a higher number of files would dictate searching through a larger number of files and also higher read amplification to some extent, because the backing data structures can potentially be compacted in size across files. 
+Management of these files is crucial when it comes to query latency because a higher number of files would dictate searching through a larger number of files and also higher read amplification to some extent, because the backing data structures can potentially be compacted in size across files.
 
-The merger sees the files on disk and plans out which segments to merge so that the final layout of segment tiers (each tier having multiple files), which grow in a logarithmic way (the chances of larger tiers growing in number would decrease), is maintained. This also implies that deciding this first-tier size becomes important in deciding the number of segment files across all tiers. 
+The merger sees the files on disk and plans out which segments to merge so that the final layout of segment tiers (each tier having multiple files), which grow in a logarithmic way (the chances of larger tiers growing in number would decrease), is maintained. This also implies that deciding this first-tier size becomes important in deciding the number of segment files across all tiers.
 
-Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behaviour is in line with the use case and aware of the payload/doc size. 
-- This config can also be tuned to dictate how the I/O behaviour should be within an index. While tuning this config, it should be in proportion to the `MaxSizeInMemoryMergePerWorker` since that dictates the amount of data flushed out per flush. 
-- The observation here is that `FloorSegmentFileSize` is lesser than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`.
+Starting with v2.5.0, this first-tier size is dependent on the file size using the `FloorSegmentFileSize` config, because that's a better metric to consider (unlike the legacy live doc count metric) in order to ensure that the behaviour is in line with the use case and aware of the payload/doc size.
 
+- This config can also be tuned to dictate how the I/O behaviour should be within an index. While tuning this config, it should be in proportion to the `MaxSizeInMemoryMergePerWorker` since that dictates the amount of data flushed out per flush.
+- The observation here is that `FloorSegmentFileSize` is lesser than `MaxSizeInMemoryMergePerWorker` and for an optimal I/O during indexing, this value can be set close to `MaxSizeInMemoryMergePerWorker/6`.
 
 ## Setting a Persister/Merger Config in Index
 
 The configs are set via the `kvConfig` parameter in the `NewUsing()` or `OpenUsing()` API:
 
 ```go
-    // setting the persister and merger configs
-	kvConfig := map[string]interface{}{
-		"scorchPersisterOptions": map[string]interface{}{
-			"NumPersisterWorkers":           4,
-			"MaxSizeInMemoryMergePerWorker": 20000000,
-		},
-		"scorchMergePlanOptions": map[string]interface{}{
-			"FloorSegmentFileSize": 10000000,
-		},
-	}
-	// passing the config to the index
-	index, err := bleve.NewUsing("example.bleve", bleve.NewIndexMapping(), bleve.Config.DefaultIndexType, bleve.Config.DefaultMemKVStore, kvConfig)
-	if err != nil {
-		panic(err)
-	}
+// setting the persister and merger configs
+kvConfig := map[string]interface{}{
+    "scorchPersisterOptions": map[string]interface{}{
+        "NumPersisterWorkers":           4,
+        "MaxSizeInMemoryMergePerWorker": 20000000,
+    },
+    "scorchMergePlanOptions": map[string]interface{}{
+        "FloorSegmentFileSize": 10000000,
+    },
+}
+// passing the config to the index
+index, err := bleve.NewUsing("example.bleve", bleve.NewIndexMapping(), bleve.Config.DefaultIndexType, bleve.Config.DefaultMemKVStore, kvConfig)
+if err != nil {
+    panic(err)
+}
 ```
diff --git a/docs/score_fusion.md b/docs/score_fusion.md
@@ -28,6 +28,7 @@ RRF\_score = w_{\text{fts}} \cdot \frac{1}{k + \text{rank}_{\text{fts}}} + \sum_
 ```
 
 Where:
+
 * $\text{rank}_{\text{fts}}$: 1-indexed rank of the document in the FTS result list (or 0 if not present)
 * $\text{rank}_{\text{knn}_i}$: 1-indexed rank of the document in the i-th kNN result list (or 0 if not present)
 * $k$: rank constant (default: 60) that dampens the impact of rank differences
@@ -36,12 +37,14 @@ Where:
 * $\sum_{i=1}^{n}$: summation over all kNN queries (you can add multiple kNN queries)
 
 **Advantages:**
-* Distribution-agnostic – no need for score normalization
+
+* Distribution-agnostic - no need for score normalization
 * Works out of the box with minimal tuning
 * Prioritizes documents appearing in both result lists
 * Robust to outliers since only ranks matter
 
 **Disadvantages:**
+
 * Ignores score magnitude (loses some information)
 * May be sensitive to imbalanced result list sizes
 
@@ -86,28 +89,31 @@ Relative Score Fusion is a **score-based** strategy that normalizes scores from
 
 1. **Min-max normalize** each result set independently:
 
-```math
-\text{normalized\_score} = \frac{\text{score} - \text{min\_score}}{\text{max\_score} - \text{min\_score}}
-```
+    ```math
+    \text{normalized\_score} = \frac{\text{score} - \text{min\_score}}{\text{max\_score} - \text{min\_score}}
+    ```
 
 2. **Combine** normalized scores using weighted addition:
 
-```math
-RSF\_score = w_{\text{fts}} \cdot \text{normalized\_score\_fts} + \sum_{i=1}^{n} w_{\text{knn}_i} \cdot \text{normalized\_score\_knn}_i
-```
+    ```math
+    RSF\_score = w_{\text{fts}} \cdot \text{normalized\_score\_fts} + \sum_{i=1}^{n} w_{\text{knn}_i} \cdot \text{normalized\_score\_knn}_i
+    ```
 
 Where:
+
 * $w_{\text{fts}}$: weight from the FTS query boost value
 * $w_{\text{knn}_i}$: weight from the i-th kNN query boost value
 * $\sum_{i=1}^{n}$: summation over all kNN queries (you can add multiple kNN queries)
 
 **Advantages:**
-* Score-aware – retains relevance magnitude information
+
+* Score-aware - retains relevance magnitude information
 * Resolves incompatible score ranges
 * Easy to understand
 
 **Disadvantages:**
-* Sensitive to outliers – a single extreme score can skew normalization
+
+* Sensitive to outliers - a single extreme score can skew normalization
 * Doesn't account for the shape or distribution of scores
 
 **Usage:**
@@ -171,6 +177,7 @@ From + Size <= ScoreWindowSize
 ```
 
 **Example:**
+
 ```json
 {
   "score": "rrf",
@@ -195,6 +202,7 @@ With window size set to 150, you can paginate through up to 150 results. If you
 * **Effect**: Higher values dampen the impact of rank differences
 
 **Example:**
+
 ```json
 {
   "score": "rrf",
diff --git a/docs/vectors.md b/docs/vectors.md