Sparse vector indexes (#770)

Simran-B · nerpaula · web-flow · commit d596f51d616a · 2025-10-22T14:13:31.000+02:00
* Vector index type now supports sparse option

* Use same wording for sparse HTTP API docs, replace hash with persistent index, fix statements about estimates

---------

Co-authored-by: Paula Mihu &lt;97217318+nerpaula@users.noreply.github.com&gt;
diff --git a/site/content/3.12/develop/foxx-microservices/guides/authentication-and-sessions.md b/site/content/3.12/develop/foxx-microservices/guides/authentication-and-sessions.md
@@ -23,7 +23,7 @@ authentication.
 In this example we'll use two collections: a `users` collection to store the
 user objects with names and credentials, and a `sessions` collection to store
 the session data. We'll also make sure usernames are unique
-by adding a hash index:
+by adding a `persistent` index:
 
 ```js
 "use strict";
@@ -37,7 +37,7 @@ if (!db._collection(sessions)) {
   db._createDocumentCollection(sessions);
 }
 module.context.collection("users").ensureIndex({
-  type: "hash",
+  type: "persistent",
   unique: true,
   fields: ["username"]
 });
diff --git a/site/content/3.12/develop/http-api/indexes/_index.md b/site/content/3.12/develop/http-api/indexes/_index.md
@@ -221,8 +221,8 @@ paths:
         insert a value into the index that already exists in the index always fails,
         regardless of the value of this attribute.
 
-        The optional **estimates** attribute is supported by persistent indexes.
-        This attribute controls whether index selectivity estimates are
+        The optional **estimates** attribute is supported by `persistent`, `mdi`, and
+        `mdi-prefixed` indexes. This attribute controls whether index selectivity estimates are
         maintained for the index. Not maintaining index selectivity estimates can have
         a slightly positive impact on write performance.
         The downside of turning off index selectivity estimates will be that
diff --git a/site/content/3.12/develop/http-api/indexes/multi-dimensional.md b/site/content/3.12/develop/http-api/indexes/multi-dimensional.md
@@ -111,7 +111,10 @@ paths:
                   default: false
                 sparse:
                   description: |
-                    If `true`, then create a sparse index.
+                    Whether to create a sparse index that excludes documents with
+                    at least one of the attributes for indexing missing or set to
+                    `null`. These attributes are defined by `fields` and (for
+                    `mdi-prefixed` indexes) by `prefixFields`.
                   type: boolean
                   default: false
                 estimates:
diff --git a/site/content/3.12/develop/http-api/indexes/persistent.md b/site/content/3.12/develop/http-api/indexes/persistent.md
@@ -113,8 +113,9 @@ paths:
                   default: false
                 sparse:
                   description: |
-                    Whether create a sparse index that excludes documents with at least
-                    one of the `fields` missing or set to `null`.
+                    Whether to create a sparse index that excludes documents with
+                    at least one of the attributes for indexing missing or set to
+                    `null`. These attributes are defined by `fields`.
                   type: boolean
                   default: false
                 deduplicate:
diff --git a/site/content/3.12/develop/http-api/indexes/vector.md b/site/content/3.12/develop/http-api/indexes/vector.md
@@ -57,14 +57,21 @@ paths:
                     A list with exactly one attribute path to specify
                     where the vector embedding is stored in each document. The vector data needs
                     to be populated before creating the index.
-                    
+
                     If you want to index another vector embedding attribute, you need to create a
                     separate vector index.
                   type: array
                   minItems: 1
                   maxItems: 1
                   items:
                     type: string
+                sparse:
+                  description: |
+                    Whether to create a sparse index that excludes documents with
+                    the attribute for indexing missing or set to `null`. This
+                    attribute is defined by `fields`.
+                  type: boolean
+                  default: false
                 parallelism:
                   description: |
                     The number of threads to use for indexing.
diff --git a/site/content/3.12/index-and-search/indexing/index-utilization.md b/site/content/3.12/index-and-search/indexing/index-utilization.md
@@ -19,8 +19,9 @@ It is often beneficial to create an index on more than just one attribute. By ad
 to an index, an index can become more selective and thus reduce the number of documents that 
 queries need to process.
 
-ArangoDB's primary indexes, edges indexes and hash indexes will automatically provide selectivity
-estimates. Index selectivity estimates are provided in the web interface, the `indexes()` return 
+ArangoDB's `primary` and `edge` indexes automatically provide selectivity estimates.
+The `persistent`, `mdi`, and `mdi-prefixed` indexes do too, by default.
+Index selectivity estimates are provided in the web interface, the `indexes()` return 
 value and in the `explain()` output for a given query. 
 
 The more selective an index is, the more documents it will filter on average. The index selectivity 
diff --git a/site/content/3.12/index-and-search/indexing/which-index-to-use-when.md b/site/content/3.12/index-and-search/indexing/which-index-to-use-when.md
@@ -175,11 +175,11 @@ db.collection.ensureIndex({ type: "persistent", fields: [ "attributeName1", "att
 When not explicitly set, the `sparse` attribute defaults to `false` for new indexes.
 Indexes other than persistent do not support the `sparse` option.
 
-As sparse indexes may exclude some documents from the collection, they cannot be used for
-all types of queries. Sparse hash indexes cannot be used to find documents for which at
-least one of the indexed attributes has a value of `null`. For example, the following AQL
-query cannot use a sparse index, even if one was created on attribute `attr`:
-<!-- TODO Remove above statement? -->
+As sparse indexes may exclude some documents from the collection, they cannot
+be used for all types of queries. For example, sparse persistent indexes cannot
+be used to find documents for which at least one of the indexed attributes
+is missing or has a value of `null`. For example, the following AQL
+query cannot use a sparse index over the attribute `attr`:
 
 ```aql
 FOR doc In collection
@@ -189,15 +189,25 @@ FOR doc In collection
 
 If the lookup value is non-constant, a sparse index may or may not be used, depending on
 the other types of conditions in the query. If the optimizer can safely determine that
-the lookup value cannot be `null`, a sparse index may be used. When uncertain, the optimizer
-does not make use of a sparse index in a query in order to produce correct results.
+the lookup value cannot be `null`, a sparse index may be used.
+
+```aql
+FOR doc In collection
+  LET random = RAND() * 5
+  FILTER doc.attr < random // Includes numbers < random but also true, false, and null!
+  FILTER doc.attr != null  // Explicitly exclude null to make a sparse index eligible
+  RETURN doc
+```
+
+When uncertain, the optimizer does not make use of a sparse index in a query in
+order to produce correct results.
 
 For example, the following queries cannot use a sparse index on `attr` because the optimizer
 does not know beforehand whether the values which are compared to `doc.attr` include `null`:
 
 ```aql
 FOR doc In collection 
-  FILTER doc.attr == SOME_FUNCTION(...) 
+  FILTER doc.attr == SOME_FUNCTION(...)
   RETURN doc
 ```
 
diff --git a/site/content/3.12/index-and-search/indexing/working-with-indexes/vector-indexes.md b/site/content/3.12/index-and-search/indexing/working-with-indexes/vector-indexes.md
@@ -62,15 +62,18 @@ centroids and the quality of vector search thus degrades.
 - **fields** (array of strings): A list with a single attribute path to specify
   where the vector embedding is stored in each document. The vector data needs
   to be populated before creating the index.
-  
+
   If you want to index another vector embedding attribute, you need to create a
   separate vector index.
+- **sparse** (boolean): Whether to create a sparse index that excludes documents
+  with the attribute for indexing missing or set to `null`. This attribute is
+  defined by `fields`. Default: `false`.
 - **parallelism** (number):
-  The number of threads to use for indexing. The default is `2`.
+  The number of threads to use for indexing. Default: `2`.
 - **inBackground** (boolean):
   Set this option to `true` to keep the collection/shards available for
   write operations by not using an exclusive write lock for the duration
-  of the index creation. The default is `false`.
+  of the index creation. Default: `false`.
 - **params**: The parameters as used by the Faiss library.
   - **metric** (string): The measure for calculating the vector similarity:
     - `"cosine"`: Angular similarity. Vectors are automatically
@@ -92,11 +95,11 @@ centroids and the quality of vector search thus degrades.
     number of documents.
   - **defaultNProbe** (number, _optional_): How many neighboring centroids to
     consider for the search results by default. The larger the number, the slower
-    the search but the better the search results. The default is `1`. You should
+    the search but the better the search results. Default: `1`. You should
     generally use a higher value here or per query via the `nProbe` option of
     the vector similarity functions.
   - **trainingIterations** (number, _optional_): The number of iterations in the
-    training process. The default is `25`. Smaller values lead to a faster index
+    training process. Default: `25`. Smaller values lead to a faster index
     creation but may yield worse search results. 
   - **factory** (string, _optional_): You can specify an index factory string that is
     forwarded to the underlying Faiss library, allowing you to combine different
diff --git a/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md b/site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md
@@ -1465,10 +1465,12 @@ has been added.
 
 <small>Introduced in: v3.12.6</small>
 
+Vector indexes can now be sparse to exclude documents with the embedding attribute
+for indexing missing or set to `null`.
+
 Another metric has been added. The `innerProduct` is a vector similarity measure
 calculated using the dot product of two vectors without normalizing them.
 Therefore, it compares not only the angle but also the magnitudes.
-
 The accompanying AQL function is the following:
 
 - `APPROX_NEAR_INNER_PRODUCT()`
diff --git a/site/content/3.13/develop/foxx-microservices/guides/authentication-and-sessions.md b/site/content/3.13/develop/foxx-microservices/guides/authentication-and-sessions.md
@@ -23,7 +23,7 @@ authentication.
 In this example we'll use two collections: a `users` collection to store the
 user objects with names and credentials, and a `sessions` collection to store
 the session data. We'll also make sure usernames are unique
-by adding a hash index:
+by adding a `persistent` index:
 
 ```js
 "use strict";
@@ -37,7 +37,7 @@ if (!db._collection(sessions)) {
   db._createDocumentCollection(sessions);
 }
 module.context.collection("users").ensureIndex({
-  type: "hash",
+  type: "persistent",
   unique: true,
   fields: ["username"]
 });
diff --git a/site/content/3.13/develop/http-api/indexes/_index.md b/site/content/3.13/develop/http-api/indexes/_index.md
@@ -221,8 +221,8 @@ paths:
         insert a value into the index that already exists in the index always fails,
         regardless of the value of this attribute.
 
-        The optional **estimates** attribute is supported by persistent indexes.
-        This attribute controls whether index selectivity estimates are
+        The optional **estimates** attribute is supported by `persistent`, `mdi`, and
+        `mdi-prefixed` indexes. This attribute controls whether index selectivity estimates are
         maintained for the index. Not maintaining index selectivity estimates can have
         a slightly positive impact on write performance.
         The downside of turning off index selectivity estimates will be that
diff --git a/site/content/3.13/develop/http-api/indexes/multi-dimensional.md b/site/content/3.13/develop/http-api/indexes/multi-dimensional.md
@@ -111,7 +111,10 @@ paths:
                   default: false
                 sparse:
                   description: |
-                    If `true`, then create a sparse index.
+                    Whether to create a sparse index that excludes documents with
+                    at least one of the attributes for indexing missing or set to
+                    `null`. These attributes are defined by `fields` and (for
+                    `mdi-prefixed` indexes) by `prefixFields`.
                   type: boolean
                   default: false
                 estimates:
diff --git a/site/content/3.13/develop/http-api/indexes/persistent.md b/site/content/3.13/develop/http-api/indexes/persistent.md
@@ -113,8 +113,9 @@ paths:
                   default: false
                 sparse:
                   description: |
-                    Whether create a sparse index that excludes documents with at least
-                    one of the `fields` missing or set to `null`.
+                    Whether to create a sparse index that excludes documents with
+                    at least one of the attributes for indexing missing or set to
+                    `null`. These attributes are defined by `fields`.
                   type: boolean
                   default: false
                 deduplicate:
diff --git a/site/content/3.13/develop/http-api/indexes/vector.md b/site/content/3.13/develop/http-api/indexes/vector.md
@@ -57,14 +57,21 @@ paths:
                     A list with exactly one attribute path to specify
                     where the vector embedding is stored in each document. The vector data needs
                     to be populated before creating the index.
-                    
+
                     If you want to index another vector embedding attribute, you need to create a
                     separate vector index.
                   type: array
                   minItems: 1
                   maxItems: 1
                   items:
                     type: string
+                sparse:
+                  description: |
+                    Whether to create a sparse index that excludes documents with
+                    the attribute for indexing missing or set to `null`. This
+                    attribute is defined by `fields`.
+                  type: boolean
+                  default: false
                 parallelism:
                   description: |
                     The number of threads to use for indexing.
diff --git a/site/content/3.13/index-and-search/indexing/index-utilization.md b/site/content/3.13/index-and-search/indexing/index-utilization.md
@@ -19,8 +19,9 @@ It is often beneficial to create an index on more than just one attribute. By ad
 to an index, an index can become more selective and thus reduce the number of documents that 
 queries need to process.
 
-ArangoDB's primary indexes, edges indexes and hash indexes will automatically provide selectivity
-estimates. Index selectivity estimates are provided in the web interface, the `indexes()` return 
+ArangoDB's `primary` and `edge` indexes automatically provide selectivity estimates.
+The `persistent`, `mdi`, and `mdi-prefixed` indexes do too, by default.
+Index selectivity estimates are provided in the web interface, the `indexes()` return 
 value and in the `explain()` output for a given query. 
 
 The more selective an index is, the more documents it will filter on average. The index selectivity 
diff --git a/site/content/3.13/index-and-search/indexing/which-index-to-use-when.md b/site/content/3.13/index-and-search/indexing/which-index-to-use-when.md
@@ -175,11 +175,11 @@ db.collection.ensureIndex({ type: "persistent", fields: [ "attributeName1", "att
 When not explicitly set, the `sparse` attribute defaults to `false` for new indexes.
 Indexes other than persistent do not support the `sparse` option.
 
-As sparse indexes may exclude some documents from the collection, they cannot be used for
-all types of queries. Sparse hash indexes cannot be used to find documents for which at
-least one of the indexed attributes has a value of `null`. For example, the following AQL
-query cannot use a sparse index, even if one was created on attribute `attr`:
-<!-- TODO Remove above statement? -->
+As sparse indexes may exclude some documents from the collection, they cannot
+be used for all types of queries. For example, sparse persistent indexes cannot
+be used to find documents for which at least one of the indexed attributes
+is missing or has a value of `null`. For example, the following AQL
+query cannot use a sparse index over the attribute `attr`:
 
 ```aql
 FOR doc In collection
@@ -189,15 +189,25 @@ FOR doc In collection
 
 If the lookup value is non-constant, a sparse index may or may not be used, depending on
 the other types of conditions in the query. If the optimizer can safely determine that
-the lookup value cannot be `null`, a sparse index may be used. When uncertain, the optimizer
-does not make use of a sparse index in a query in order to produce correct results.
+the lookup value cannot be `null`, a sparse index may be used.
+
+```aql
+FOR doc In collection
+  LET random = RAND() * 5
+  FILTER doc.attr < random // Includes numbers < random but also true, false, and null!
+  FILTER doc.attr != null  // Explicitly exclude null to make a sparse index eligible
+  RETURN doc
+```
+
+When uncertain, the optimizer does not make use of a sparse index in a query in
+order to produce correct results.
 
 For example, the following queries cannot use a sparse index on `attr` because the optimizer
 does not know beforehand whether the values which are compared to `doc.attr` include `null`:
 
 ```aql
 FOR doc In collection 
-  FILTER doc.attr == SOME_FUNCTION(...) 
+  FILTER doc.attr == SOME_FUNCTION(...)
   RETURN doc
 ```
 
diff --git a/site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md b/site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md
@@ -62,15 +62,18 @@ centroids and the quality of vector search thus degrades.
 - **fields** (array of strings): A list with a single attribute path to specify
   where the vector embedding is stored in each document. The vector data needs
   to be populated before creating the index.
-  
+
   If you want to index another vector embedding attribute, you need to create a
   separate vector index.
+- **sparse** (boolean): Whether to create a sparse index that excludes documents
+  with the attribute for indexing missing or set to `null`. This attribute is
+  defined by `fields`. Default: `false`.
 - **parallelism** (number):
-  The number of threads to use for indexing. The default is `2`.
+  The number of threads to use for indexing. Default: `2`.
 - **inBackground** (boolean):
   Set this option to `true` to keep the collection/shards available for
   write operations by not using an exclusive write lock for the duration
-  of the index creation. The default is `false`.
+  of the index creation. Default: `false`.
 - **params**: The parameters as used by the Faiss library.
   - **metric** (string): The measure for calculating the vector similarity:
     - `"cosine"`: Angular similarity. Vectors are automatically
@@ -92,11 +95,11 @@ centroids and the quality of vector search thus degrades.
     number of documents.
   - **defaultNProbe** (number, _optional_): How many neighboring centroids to
     consider for the search results by default. The larger the number, the slower
-    the search but the better the search results. The default is `1`. You should
+    the search but the better the search results. Default: `1`. You should
     generally use a higher value here or per query via the `nProbe` option of
     the vector similarity functions.
   - **trainingIterations** (number, _optional_): The number of iterations in the
-    training process. The default is `25`. Smaller values lead to a faster index
+    training process. Default: `25`. Smaller values lead to a faster index
     creation but may yield worse search results. 
   - **factory** (string, _optional_): You can specify an index factory string that is
     forwarded to the underlying Faiss library, allowing you to combine different
diff --git a/site/content/3.13/release-notes/version-3.12/whats-new-in-3-12.md b/site/content/3.13/release-notes/version-3.12/whats-new-in-3-12.md
@@ -1465,10 +1465,12 @@ has been added.
 
 <small>Introduced in: v3.12.6</small>
 
+Vector indexes can now be sparse to exclude documents with the embedding attribute
+for indexing missing or set to `null`.
+
 Another metric has been added. The `innerProduct` is a vector similarity measure
 calculated using the dot product of two vectors without normalizing them.
 Therefore, it compares not only the angle but also the magnitudes.
-
 The accompanying AQL function is the following:
 
 - `APPROX_NEAR_INNER_PRODUCT()`