Skip to content

Commit d596f51

Browse files
Simran-Bnerpaula
andauthored
Sparse vector indexes (#770)
* Vector index type now supports sparse option * Use same wording for sparse HTTP API docs, replace hash with persistent index, fix statements about estimates --------- Co-authored-by: Paula Mihu <[email protected]>
1 parent 3de2dbf commit d596f51

File tree

18 files changed

+102
-48
lines changed

18 files changed

+102
-48
lines changed

site/content/3.12/develop/foxx-microservices/guides/authentication-and-sessions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ authentication.
2323
In this example we'll use two collections: a `users` collection to store the
2424
user objects with names and credentials, and a `sessions` collection to store
2525
the session data. We'll also make sure usernames are unique
26-
by adding a hash index:
26+
by adding a `persistent` index:
2727

2828
```js
2929
"use strict";
@@ -37,7 +37,7 @@ if (!db._collection(sessions)) {
3737
db._createDocumentCollection(sessions);
3838
}
3939
module.context.collection("users").ensureIndex({
40-
type: "hash",
40+
type: "persistent",
4141
unique: true,
4242
fields: ["username"]
4343
});

site/content/3.12/develop/http-api/indexes/_index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -221,8 +221,8 @@ paths:
221221
insert a value into the index that already exists in the index always fails,
222222
regardless of the value of this attribute.
223223
224-
The optional **estimates** attribute is supported by persistent indexes.
225-
This attribute controls whether index selectivity estimates are
224+
The optional **estimates** attribute is supported by `persistent`, `mdi`, and
225+
`mdi-prefixed` indexes. This attribute controls whether index selectivity estimates are
226226
maintained for the index. Not maintaining index selectivity estimates can have
227227
a slightly positive impact on write performance.
228228
The downside of turning off index selectivity estimates will be that

site/content/3.12/develop/http-api/indexes/multi-dimensional.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,10 @@ paths:
111111
default: false
112112
sparse:
113113
description: |
114-
If `true`, then create a sparse index.
114+
Whether to create a sparse index that excludes documents with
115+
at least one of the attributes for indexing missing or set to
116+
`null`. These attributes are defined by `fields` and (for
117+
`mdi-prefixed` indexes) by `prefixFields`.
115118
type: boolean
116119
default: false
117120
estimates:

site/content/3.12/develop/http-api/indexes/persistent.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,8 +113,9 @@ paths:
113113
default: false
114114
sparse:
115115
description: |
116-
Whether create a sparse index that excludes documents with at least
117-
one of the `fields` missing or set to `null`.
116+
Whether to create a sparse index that excludes documents with
117+
at least one of the attributes for indexing missing or set to
118+
`null`. These attributes are defined by `fields`.
118119
type: boolean
119120
default: false
120121
deduplicate:

site/content/3.12/develop/http-api/indexes/vector.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,14 +57,21 @@ paths:
5757
A list with exactly one attribute path to specify
5858
where the vector embedding is stored in each document. The vector data needs
5959
to be populated before creating the index.
60-
60+
6161
If you want to index another vector embedding attribute, you need to create a
6262
separate vector index.
6363
type: array
6464
minItems: 1
6565
maxItems: 1
6666
items:
6767
type: string
68+
sparse:
69+
description: |
70+
Whether to create a sparse index that excludes documents with
71+
the attribute for indexing missing or set to `null`. This
72+
attribute is defined by `fields`.
73+
type: boolean
74+
default: false
6875
parallelism:
6976
description: |
7077
The number of threads to use for indexing.

site/content/3.12/index-and-search/indexing/index-utilization.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,9 @@ It is often beneficial to create an index on more than just one attribute. By ad
1919
to an index, an index can become more selective and thus reduce the number of documents that
2020
queries need to process.
2121

22-
ArangoDB's primary indexes, edges indexes and hash indexes will automatically provide selectivity
23-
estimates. Index selectivity estimates are provided in the web interface, the `indexes()` return
22+
ArangoDB's `primary` and `edge` indexes automatically provide selectivity estimates.
23+
The `persistent`, `mdi`, and `mdi-prefixed` indexes do too, by default.
24+
Index selectivity estimates are provided in the web interface, the `indexes()` return
2425
value and in the `explain()` output for a given query.
2526

2627
The more selective an index is, the more documents it will filter on average. The index selectivity

site/content/3.12/index-and-search/indexing/which-index-to-use-when.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -175,11 +175,11 @@ db.collection.ensureIndex({ type: "persistent", fields: [ "attributeName1", "att
175175
When not explicitly set, the `sparse` attribute defaults to `false` for new indexes.
176176
Indexes other than persistent do not support the `sparse` option.
177177

178-
As sparse indexes may exclude some documents from the collection, they cannot be used for
179-
all types of queries. Sparse hash indexes cannot be used to find documents for which at
180-
least one of the indexed attributes has a value of `null`. For example, the following AQL
181-
query cannot use a sparse index, even if one was created on attribute `attr`:
182-
<!-- TODO Remove above statement? -->
178+
As sparse indexes may exclude some documents from the collection, they cannot
179+
be used for all types of queries. For example, sparse persistent indexes cannot
180+
be used to find documents for which at least one of the indexed attributes
181+
is missing or has a value of `null`. For example, the following AQL
182+
query cannot use a sparse index over the attribute `attr`:
183183

184184
```aql
185185
FOR doc In collection
@@ -189,15 +189,25 @@ FOR doc In collection
189189

190190
If the lookup value is non-constant, a sparse index may or may not be used, depending on
191191
the other types of conditions in the query. If the optimizer can safely determine that
192-
the lookup value cannot be `null`, a sparse index may be used. When uncertain, the optimizer
193-
does not make use of a sparse index in a query in order to produce correct results.
192+
the lookup value cannot be `null`, a sparse index may be used.
193+
194+
```aql
195+
FOR doc In collection
196+
LET random = RAND() * 5
197+
FILTER doc.attr < random // Includes numbers < random but also true, false, and null!
198+
FILTER doc.attr != null // Explicitly exclude null to make a sparse index eligible
199+
RETURN doc
200+
```
201+
202+
When uncertain, the optimizer does not make use of a sparse index in a query in
203+
order to produce correct results.
194204

195205
For example, the following queries cannot use a sparse index on `attr` because the optimizer
196206
does not know beforehand whether the values which are compared to `doc.attr` include `null`:
197207

198208
```aql
199209
FOR doc In collection
200-
FILTER doc.attr == SOME_FUNCTION(...)
210+
FILTER doc.attr == SOME_FUNCTION(...)
201211
RETURN doc
202212
```
203213

site/content/3.12/index-and-search/indexing/working-with-indexes/vector-indexes.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -62,15 +62,18 @@ centroids and the quality of vector search thus degrades.
6262
- **fields** (array of strings): A list with a single attribute path to specify
6363
where the vector embedding is stored in each document. The vector data needs
6464
to be populated before creating the index.
65-
65+
6666
If you want to index another vector embedding attribute, you need to create a
6767
separate vector index.
68+
- **sparse** (boolean): Whether to create a sparse index that excludes documents
69+
with the attribute for indexing missing or set to `null`. This attribute is
70+
defined by `fields`. Default: `false`.
6871
- **parallelism** (number):
69-
The number of threads to use for indexing. The default is `2`.
72+
The number of threads to use for indexing. Default: `2`.
7073
- **inBackground** (boolean):
7174
Set this option to `true` to keep the collection/shards available for
7275
write operations by not using an exclusive write lock for the duration
73-
of the index creation. The default is `false`.
76+
of the index creation. Default: `false`.
7477
- **params**: The parameters as used by the Faiss library.
7578
- **metric** (string): The measure for calculating the vector similarity:
7679
- `"cosine"`: Angular similarity. Vectors are automatically
@@ -92,11 +95,11 @@ centroids and the quality of vector search thus degrades.
9295
number of documents.
9396
- **defaultNProbe** (number, _optional_): How many neighboring centroids to
9497
consider for the search results by default. The larger the number, the slower
95-
the search but the better the search results. The default is `1`. You should
98+
the search but the better the search results. Default: `1`. You should
9699
generally use a higher value here or per query via the `nProbe` option of
97100
the vector similarity functions.
98101
- **trainingIterations** (number, _optional_): The number of iterations in the
99-
training process. The default is `25`. Smaller values lead to a faster index
102+
training process. Default: `25`. Smaller values lead to a faster index
100103
creation but may yield worse search results.
101104
- **factory** (string, _optional_): You can specify an index factory string that is
102105
forwarded to the underlying Faiss library, allowing you to combine different

site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1465,10 +1465,12 @@ has been added.
14651465

14661466
<small>Introduced in: v3.12.6</small>
14671467

1468+
Vector indexes can now be sparse to exclude documents with the embedding attribute
1469+
for indexing missing or set to `null`.
1470+
14681471
Another metric has been added. The `innerProduct` is a vector similarity measure
14691472
calculated using the dot product of two vectors without normalizing them.
14701473
Therefore, it compares not only the angle but also the magnitudes.
1471-
14721474
The accompanying AQL function is the following:
14731475

14741476
- `APPROX_NEAR_INNER_PRODUCT()`

site/content/3.13/develop/foxx-microservices/guides/authentication-and-sessions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ authentication.
2323
In this example we'll use two collections: a `users` collection to store the
2424
user objects with names and credentials, and a `sessions` collection to store
2525
the session data. We'll also make sure usernames are unique
26-
by adding a hash index:
26+
by adding a `persistent` index:
2727

2828
```js
2929
"use strict";
@@ -37,7 +37,7 @@ if (!db._collection(sessions)) {
3737
db._createDocumentCollection(sessions);
3838
}
3939
module.context.collection("users").ensureIndex({
40-
type: "hash",
40+
type: "persistent",
4141
unique: true,
4242
fields: ["username"]
4343
});

0 commit comments

Comments
 (0)