Skip to content

Commit c3a6a51

Browse files
Add Vector Search storage format spec (#456)
Add Vector Search storage format spec
1 parent 64f6f6b commit c3a6a51

File tree

2 files changed

+243
-0
lines changed

2 files changed

+243
-0
lines changed

_quarto.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ website:
7171
- href: "documentation/index.md"
7272
- href: "documentation/Building.md"
7373
- href: "documentation/Benchmarks.md"
74+
- href: "documentation/storage-format-spec.md"
7475

7576
# - section: "Examples"
7677
# contents:
Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
---
2+
title: Storage Format Spec
3+
description: "Learn about the vector search storage format specification for different indexing algorithms."
4+
---
5+
6+
The underlying storage model used for indexing vectors in TileDB-Vector-Search is heavily dependent on the indexing algorithm used. However, there are also high level structures that are used across algorithms.
7+
8+
## Cross algorithm storage format
9+
10+
All data and metadata required for a TileDB-Vector-Search index are stored inside a TileDB group (`index_uri`). All the listed, named arrays below are stored under this URI.
11+
12+
### Index metadata
13+
14+
Metadata values required for configuring the different properties of an index are stored in the `index_uri` group metadata. There are some metadata values that are required for all algorithm implementations as well as per-algorithm specific metadata values. Below is a table of all the metadata values that are recorded for all algorithms.
15+
16+
| Name | Description |
17+
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
18+
| `dataset_type` | The asset type for disambiguation in TileDB cloud. Value: `vector_search` |
19+
| `index_type` | The index algorithm used for this index. Can be one of the following values: `FLAT`, `IVF_FLAT`, `VAMANA`, `IVF_PQ` |
20+
| `storage_version` | The storage version used for the index. The storage version is used to make sure that indexing algorithms can update their storage logic without affecting previously created indexes and maintaining backwards compatibility. |
21+
| `dtype` | The data type of the vector values. |
22+
| `ingestion_timestamps` | An ordered list of timestamps that correspond to different calls of ingestion and update consolidation through the lifetime of the index. |
23+
| `base_sizes` | An ordered list of number of vectors in the base index at the different ingestion timestamps. |
24+
| `has_updates` | Boolean value denoting if there are updates recorded in the updates array. |
25+
26+
### Object metadata
27+
28+
This is a 1D sparse array with `external_id` as dimension and attributes the user defined metadata attributes for the respective vectors.
29+
30+
#### Basic schema parameters
31+
32+
| **Parameter** | **Value** |
33+
|:--------------|:----------|
34+
| Array type | Sparse |
35+
| Rank | 1D |
36+
| Cell order | Row-major |
37+
| Tile order | Row-major |
38+
39+
#### Dimensions
40+
41+
| Dimension Name | TileDB Datatype |
42+
| :------------- | :-------------------- |
43+
| `external_id` | `uint64_t` |
44+
45+
### Updates
46+
47+
TileDB-Vector-Search offers support for updates for all different index algorithms by recording updates outside the main indexing storage structure and periodically consolidating them. This implementation is using the `updates` array, a sparse 1D array with dimension the `external_ids` of the vectors and 1 variable length attribute encoding the vector itself or an empty value if the vector is deleted.
48+
49+
#### Basic schema parameters
50+
51+
| **Parameter** | **Value** |
52+
| :------------ | :-------- |
53+
| Array type | Sparse |
54+
| Rank | 1D |
55+
| Cell order | Row-major |
56+
| Tile order | Row-major |
57+
58+
#### Dimensions
59+
60+
| Dimension Name | TileDB Datatype |
61+
| :------------- | :-------------------- |
62+
| `external_id` | `uint64_t` |
63+
64+
#### Attributes
65+
66+
| Attribute Name | TileDB Datatype | Description |
67+
| :--------------- | :-------------- | :--------------------------------------------------------------------------------------------- |
68+
| `vector` | variable `dtype`| Contains the vector value. Empty values correspond to vector deletions. |
69+
70+
## Algorithm specific storage format
71+
72+
### FLAT
73+
74+
#### `shuffled_vectors`
75+
76+
This is a 2D dense array that holds all the vectors with no specific ordering.
77+
78+
#### Basic schema parameters
79+
80+
| **Parameter** | **Value** |
81+
|:--------------|:----------|
82+
| Array type | Dense |
83+
| Rank | 2D |
84+
| Cell order | Col-major |
85+
| Tile order | Col-major |
86+
87+
#### Dimensions
88+
89+
| Dimension Name | TileDB Datatype | Domain | Description |
90+
|:---------------|:----------------|:------------------|:----------------------------------------------------------|
91+
| `rows` | `int32_t` | `[0, dimensions]` | Corresponds to the vector dimensions. |
92+
| `cols` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the vector position in the set of vectors. |
93+
94+
#### Attributes
95+
96+
| Attribute Name | TileDB Datatype | Description |
97+
| :--------------- | :-------------- | :---------------------------------------------------------------------------|
98+
| `values` | `dtype` | Contains the vector value at the specific dimension. |
99+
100+
#### `shuffled_ids`
101+
102+
This is a 1D dense array that maps vector positions in the `shuffled_vectors` array to `external_ids` of each vector.
103+
104+
#### Basic schema parameters
105+
106+
| **Parameter** | **Value** |
107+
| :------------ | :-------- |
108+
| Array type | Dense |
109+
| Rank | 1D |
110+
| Cell order | Col-major |
111+
| Tile order | Col-major |
112+
113+
#### Dimensions
114+
115+
| Dimension Name | TileDB Datatype | Domain | Description |
116+
| :------------- | :-------------------- | :-----------------| :--------------------------------------------------------- |
117+
| `rows` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the vector position in `shuffled_vectors`. |
118+
119+
#### Attributes
120+
121+
| Attribute Name | TileDB Datatype | Description |
122+
| :--------------- | :-------------- | :---------------------------------------------------------------------------|
123+
| `values` | `uint64_t` | Contains the vector's `external_id`. |
124+
125+
### IVF_FLAT
126+
127+
#### Metadata
128+
129+
| Name | Description |
130+
| ------ | ------ |
131+
| `partition_history` | An ordered list of the number of partitions used at different ingestion timestamps. |
132+
133+
#### `partition_centroids`
134+
135+
This is a 2D dense array storing the k-means centroids for the different vector partitions.
136+
137+
#### Basic schema parameters
138+
139+
| **Parameter** | **Value** |
140+
|:--------------|:----------|
141+
| Array type | Dense |
142+
| Rank | 2D |
143+
| Cell order | Col-major |
144+
| Tile order | Col-major |
145+
146+
#### Dimensions
147+
148+
| Dimension Name | TileDB Datatype | Domain | Description |
149+
|:---------------|:----------------|:------------------|:----------------------------------------|
150+
| `rows` | `int32_t` | `[0, dimensions]` | Corresponds to the centroid dimensions. |
151+
| `cols` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the centroid id. |
152+
153+
#### Attributes
154+
155+
| Attribute Name | TileDB Datatype | Description |
156+
| :--------------- | :-------------- | :---------------------------------------------------------------------------|
157+
| `centroids` | `dtype` | Contains the centroid value at the specific dimension. |
158+
159+
#### `partition_indexes`
160+
161+
This is a 1D dense array recording the start-end index of each partition of vectors in the `shuffled_vectors` array.
162+
163+
#### Basic schema parameters
164+
165+
| **Parameter** | **Value** |
166+
|:--------------|:----------|
167+
| Array type | Dense |
168+
| Rank | 1D |
169+
| Cell order | Col-major |
170+
| Tile order | Col-major |
171+
172+
#### Dimensions
173+
174+
| Dimension Name | TileDB Datatype | Domain | Description |
175+
| :------------- | :-------------------- | :-----------------| :------------------------------- |
176+
| `rows` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the partition id. |
177+
178+
#### Attributes
179+
180+
| Attribute Name | TileDB Datatype | Description |
181+
| :--------------- | :-------------- | :--------------------------------------------------------------------------------|
182+
| `values` | `uint64_t` | Contains to the position of the partition split in the `shuffled_vectors` array. |
183+
184+
#### `shuffled_vectors`
185+
186+
This is a 2D dense array that holds all the vectors. Each vector partition is stored in a consecutive index range of this array.
187+
188+
#### Basic schema parameters
189+
190+
| **Parameter** | **Value** |
191+
|:--------------|:----------|
192+
| Array type | Dense |
193+
| Rank | 2D |
194+
| Cell order | Col-major |
195+
| Tile order | Col-major |
196+
197+
#### Dimensions
198+
199+
| Dimension Name | TileDB Datatype | Domain | Description |
200+
| :------------- | :-------------------- | :-----------------| :--------------------------------------------------------- |
201+
| `rows` | `int32_t` | `[0, dimensions]` | Corresponds to the vector dimensions. |
202+
| `cols` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the vector position in the set of vectors. |
203+
204+
#### Attributes
205+
206+
| Attribute Name | TileDB Datatype | Description |
207+
| :--------------- | :-------------- | :---------------------------------------------------------------------------|
208+
| `values` | `dtype` | Contains the vector value at the specific dimension. |
209+
210+
#### `shuffled_ids`
211+
212+
This is a 1D dense array that maps vector indices in the `shuffled_vectors` array to `external_ids` of each vector.
213+
214+
#### Basic schema parameters
215+
216+
| **Parameter** | **Value** |
217+
|:--------------|:----------|
218+
| Array type | Dense |
219+
| Rank | 1D |
220+
| Cell order | Col-major |
221+
| Tile order | Col-major |
222+
223+
#### Dimensions
224+
225+
| Dimension Name | TileDB Datatype | Domain | Description |
226+
| :------------- | :-------------------- | :-----------------| :--------------------------------------------------------- |
227+
| `rows` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the vector position in `shuffled_vectors`. |
228+
229+
#### Attributes
230+
231+
| Attribute Name | TileDB Datatype | Description |
232+
| :--------------- | :-------------- | :---------------------------------------------------------------------------|
233+
| `values` | `uint64_t` | Contains the vector `external_id`. |
234+
235+
236+
### IVF_PQ
237+
238+
TODO
239+
240+
### VAMANA
241+
242+
TODO

0 commit comments

Comments
 (0)