|
| 1 | +--- |
| 2 | +title: Storage Format Spec |
| 3 | +description: "Learn about the vector search storage format specification for different indexing algorithms." |
| 4 | +--- |
| 5 | + |
| 6 | +The underlying storage model used for indexing vectors in TileDB-Vector-Search is heavily dependent on the indexing algorithm used. However, there are also high level structures that are used across algorithms. |
| 7 | + |
| 8 | +## Cross algorithm storage format |
| 9 | + |
| 10 | +All data and metadata required for a TileDB-Vector-Search index are stored inside a TileDB group (`index_uri`). All the listed, named arrays below are stored under this URI. |
| 11 | + |
| 12 | +### Index metadata |
| 13 | + |
| 14 | +Metadata values required for configuring the different properties of an index are stored in the `index_uri` group metadata. There are some metadata values that are required for all algorithm implementations as well as per-algorithm specific metadata values. Below is a table of all the metadata values that are recorded for all algorithms. |
| 15 | + |
| 16 | +| Name | Description | |
| 17 | +|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 18 | +| `dataset_type` | The asset type for disambiguation in TileDB cloud. Value: `vector_search` | |
| 19 | +| `index_type` | The index algorithm used for this index. Can be one of the following values: `FLAT`, `IVF_FLAT`, `VAMANA`, `IVF_PQ` | |
| 20 | +| `storage_version` | The storage version used for the index. The storage version is used to make sure that indexing algorithms can update their storage logic without affecting previously created indexes and maintaining backwards compatibility. | |
| 21 | +| `dtype` | The data type of the vector values. | |
| 22 | +| `ingestion_timestamps` | An ordered list of timestamps that correspond to different calls of ingestion and update consolidation through the lifetime of the index. | |
| 23 | +| `base_sizes` | An ordered list of number of vectors in the base index at the different ingestion timestamps. | |
| 24 | +| `has_updates` | Boolean value denoting if there are updates recorded in the updates array. | |
| 25 | + |
| 26 | +### Object metadata |
| 27 | + |
| 28 | +This is a 1D sparse array with `external_id` as dimension and attributes the user defined metadata attributes for the respective vectors. |
| 29 | + |
| 30 | +#### Basic schema parameters |
| 31 | + |
| 32 | +| **Parameter** | **Value** | |
| 33 | +|:--------------|:----------| |
| 34 | +| Array type | Sparse | |
| 35 | +| Rank | 1D | |
| 36 | +| Cell order | Row-major | |
| 37 | +| Tile order | Row-major | |
| 38 | + |
| 39 | +#### Dimensions |
| 40 | + |
| 41 | +| Dimension Name | TileDB Datatype | |
| 42 | +| :------------- | :-------------------- | |
| 43 | +| `external_id` | `uint64_t` | |
| 44 | + |
| 45 | +### Updates |
| 46 | + |
| 47 | +TileDB-Vector-Search offers support for updates for all different index algorithms by recording updates outside the main indexing storage structure and periodically consolidating them. This implementation is using the `updates` array, a sparse 1D array with dimension the `external_ids` of the vectors and 1 variable length attribute encoding the vector itself or an empty value if the vector is deleted. |
| 48 | + |
| 49 | +#### Basic schema parameters |
| 50 | + |
| 51 | +| **Parameter** | **Value** | |
| 52 | +| :------------ | :-------- | |
| 53 | +| Array type | Sparse | |
| 54 | +| Rank | 1D | |
| 55 | +| Cell order | Row-major | |
| 56 | +| Tile order | Row-major | |
| 57 | + |
| 58 | +#### Dimensions |
| 59 | + |
| 60 | +| Dimension Name | TileDB Datatype | |
| 61 | +| :------------- | :-------------------- | |
| 62 | +| `external_id` | `uint64_t` | |
| 63 | + |
| 64 | +#### Attributes |
| 65 | + |
| 66 | +| Attribute Name | TileDB Datatype | Description | |
| 67 | +| :--------------- | :-------------- | :--------------------------------------------------------------------------------------------- | |
| 68 | +| `vector` | variable `dtype`| Contains the vector value. Empty values correspond to vector deletions. | |
| 69 | + |
| 70 | +## Algorithm specific storage format |
| 71 | + |
| 72 | +### FLAT |
| 73 | + |
| 74 | +#### `shuffled_vectors` |
| 75 | + |
| 76 | +This is a 2D dense array that holds all the vectors with no specific ordering. |
| 77 | + |
| 78 | +#### Basic schema parameters |
| 79 | + |
| 80 | +| **Parameter** | **Value** | |
| 81 | +|:--------------|:----------| |
| 82 | +| Array type | Dense | |
| 83 | +| Rank | 2D | |
| 84 | +| Cell order | Col-major | |
| 85 | +| Tile order | Col-major | |
| 86 | + |
| 87 | +#### Dimensions |
| 88 | + |
| 89 | +| Dimension Name | TileDB Datatype | Domain | Description | |
| 90 | +|:---------------|:----------------|:------------------|:----------------------------------------------------------| |
| 91 | +| `rows` | `int32_t` | `[0, dimensions]` | Corresponds to the vector dimensions. | |
| 92 | +| `cols` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the vector position in the set of vectors. | |
| 93 | + |
| 94 | +#### Attributes |
| 95 | + |
| 96 | +| Attribute Name | TileDB Datatype | Description | |
| 97 | +| :--------------- | :-------------- | :---------------------------------------------------------------------------| |
| 98 | +| `values` | `dtype` | Contains the vector value at the specific dimension. | |
| 99 | + |
| 100 | +#### `shuffled_ids` |
| 101 | + |
| 102 | +This is a 1D dense array that maps vector positions in the `shuffled_vectors` array to `external_ids` of each vector. |
| 103 | + |
| 104 | +#### Basic schema parameters |
| 105 | + |
| 106 | +| **Parameter** | **Value** | |
| 107 | +| :------------ | :-------- | |
| 108 | +| Array type | Dense | |
| 109 | +| Rank | 1D | |
| 110 | +| Cell order | Col-major | |
| 111 | +| Tile order | Col-major | |
| 112 | + |
| 113 | +#### Dimensions |
| 114 | + |
| 115 | +| Dimension Name | TileDB Datatype | Domain | Description | |
| 116 | +| :------------- | :-------------------- | :-----------------| :--------------------------------------------------------- | |
| 117 | +| `rows` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the vector position in `shuffled_vectors`. | |
| 118 | + |
| 119 | +#### Attributes |
| 120 | + |
| 121 | +| Attribute Name | TileDB Datatype | Description | |
| 122 | +| :--------------- | :-------------- | :---------------------------------------------------------------------------| |
| 123 | +| `values` | `uint64_t` | Contains the vector's `external_id`. | |
| 124 | + |
| 125 | +### IVF_FLAT |
| 126 | + |
| 127 | +#### Metadata |
| 128 | + |
| 129 | +| Name | Description | |
| 130 | +| ------ | ------ | |
| 131 | +| `partition_history` | An ordered list of the number of partitions used at different ingestion timestamps. | |
| 132 | + |
| 133 | +#### `partition_centroids` |
| 134 | + |
| 135 | +This is a 2D dense array storing the k-means centroids for the different vector partitions. |
| 136 | + |
| 137 | +#### Basic schema parameters |
| 138 | + |
| 139 | +| **Parameter** | **Value** | |
| 140 | +|:--------------|:----------| |
| 141 | +| Array type | Dense | |
| 142 | +| Rank | 2D | |
| 143 | +| Cell order | Col-major | |
| 144 | +| Tile order | Col-major | |
| 145 | + |
| 146 | +#### Dimensions |
| 147 | + |
| 148 | +| Dimension Name | TileDB Datatype | Domain | Description | |
| 149 | +|:---------------|:----------------|:------------------|:----------------------------------------| |
| 150 | +| `rows` | `int32_t` | `[0, dimensions]` | Corresponds to the centroid dimensions. | |
| 151 | +| `cols` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the centroid id. | |
| 152 | + |
| 153 | +#### Attributes |
| 154 | + |
| 155 | +| Attribute Name | TileDB Datatype | Description | |
| 156 | +| :--------------- | :-------------- | :---------------------------------------------------------------------------| |
| 157 | +| `centroids` | `dtype` | Contains the centroid value at the specific dimension. | |
| 158 | + |
| 159 | +#### `partition_indexes` |
| 160 | + |
| 161 | +This is a 1D dense array recording the start-end index of each partition of vectors in the `shuffled_vectors` array. |
| 162 | + |
| 163 | +#### Basic schema parameters |
| 164 | + |
| 165 | +| **Parameter** | **Value** | |
| 166 | +|:--------------|:----------| |
| 167 | +| Array type | Dense | |
| 168 | +| Rank | 1D | |
| 169 | +| Cell order | Col-major | |
| 170 | +| Tile order | Col-major | |
| 171 | + |
| 172 | +#### Dimensions |
| 173 | + |
| 174 | +| Dimension Name | TileDB Datatype | Domain | Description | |
| 175 | +| :------------- | :-------------------- | :-----------------| :------------------------------- | |
| 176 | +| `rows` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the partition id. | |
| 177 | + |
| 178 | +#### Attributes |
| 179 | + |
| 180 | +| Attribute Name | TileDB Datatype | Description | |
| 181 | +| :--------------- | :-------------- | :--------------------------------------------------------------------------------| |
| 182 | +| `values` | `uint64_t` | Contains to the position of the partition split in the `shuffled_vectors` array. | |
| 183 | + |
| 184 | +#### `shuffled_vectors` |
| 185 | + |
| 186 | +This is a 2D dense array that holds all the vectors. Each vector partition is stored in a consecutive index range of this array. |
| 187 | + |
| 188 | +#### Basic schema parameters |
| 189 | + |
| 190 | +| **Parameter** | **Value** | |
| 191 | +|:--------------|:----------| |
| 192 | +| Array type | Dense | |
| 193 | +| Rank | 2D | |
| 194 | +| Cell order | Col-major | |
| 195 | +| Tile order | Col-major | |
| 196 | + |
| 197 | +#### Dimensions |
| 198 | + |
| 199 | +| Dimension Name | TileDB Datatype | Domain | Description | |
| 200 | +| :------------- | :-------------------- | :-----------------| :--------------------------------------------------------- | |
| 201 | +| `rows` | `int32_t` | `[0, dimensions]` | Corresponds to the vector dimensions. | |
| 202 | +| `cols` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the vector position in the set of vectors. | |
| 203 | + |
| 204 | +#### Attributes |
| 205 | + |
| 206 | +| Attribute Name | TileDB Datatype | Description | |
| 207 | +| :--------------- | :-------------- | :---------------------------------------------------------------------------| |
| 208 | +| `values` | `dtype` | Contains the vector value at the specific dimension. | |
| 209 | + |
| 210 | +#### `shuffled_ids` |
| 211 | + |
| 212 | +This is a 1D dense array that maps vector indices in the `shuffled_vectors` array to `external_ids` of each vector. |
| 213 | + |
| 214 | +#### Basic schema parameters |
| 215 | + |
| 216 | +| **Parameter** | **Value** | |
| 217 | +|:--------------|:----------| |
| 218 | +| Array type | Dense | |
| 219 | +| Rank | 1D | |
| 220 | +| Cell order | Col-major | |
| 221 | +| Tile order | Col-major | |
| 222 | + |
| 223 | +#### Dimensions |
| 224 | + |
| 225 | +| Dimension Name | TileDB Datatype | Domain | Description | |
| 226 | +| :------------- | :-------------------- | :-----------------| :--------------------------------------------------------- | |
| 227 | +| `rows` | `int32_t` | `[0, MAX_INT32]` | Corresponds to the vector position in `shuffled_vectors`. | |
| 228 | + |
| 229 | +#### Attributes |
| 230 | + |
| 231 | +| Attribute Name | TileDB Datatype | Description | |
| 232 | +| :--------------- | :-------------- | :---------------------------------------------------------------------------| |
| 233 | +| `values` | `uint64_t` | Contains the vector `external_id`. | |
| 234 | + |
| 235 | + |
| 236 | +### IVF_PQ |
| 237 | + |
| 238 | +TODO |
| 239 | + |
| 240 | +### VAMANA |
| 241 | + |
| 242 | +TODO |
0 commit comments