Skip to content

Commit f021d44

Browse files
authored
Lums/sc 40408/complete index vamana (#230)
This PR implements the coding portion of sc-40408. * Brought the implementation of `vamana_index` to use the new `index_group` and `index_metadata` classes (via CRTP). * Defined the group directory structure for the index to include an array for the feature vectors and the adjacency structure of the graph, stored in CSR format, using an array for the neighbor ids, for the distance to the neighbors, and a partitioning index. * `feature_vectors` * `adjacency_scores` * `adjacency_ids` * `adjacency_row_index` * Metadata includes * datatypes for index, scores, and neighbor ids * `L`, `R`, `alpha_min`, `alpha_max`, and `medoid` * `num_edges_history` - representing the size of the graph in number of edges. The number of vertices in the graph is the same as the number of vectors and is stored in `base_sizes`. The following schemas are taken from the group index in `test_data/nano/vamana/vamana_test_index` ``` # adjacency_ids ArraySchema( domain=Domain(*[ Dim(name='rows', domain=(0, 923), tile=924, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])), ]), attrs=[ Attr(name='values', dtype='uint64', var=False, nullable=False, enum_label=None), ], cell_order='col-major', tile_order='col-major', sparse=False, ) # adjacency_row_index ArraySchema( domain=Domain(*[ Dim(name='rows', domain=(0, 231), tile=232, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])), ]), attrs=[ Attr(name='values', dtype='uint64', var=False, nullable=False, enum_label=None), ], cell_order='col-major', tile_order='col-major', sparse=False, ) # adjacency_scores ArraySchema( domain=Domain(*[ Dim(name='rows', domain=(0, 923), tile=924, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])), ]), attrs=[ Attr(name='values', dtype='float32', var=False, nullable=False, enum_label=None), ], cell_order='col-major', tile_order='col-major', sparse=False, ) # feature_vectors ArraySchema( domain=Domain(*[ Dim(name='rows', domain=(0, 127), tile=128, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])), Dim(name='cols', domain=(0, 230), tile=231, dtype='int32', filters=FilterList([ZstdFilter(level=-1), ])), ]), attrs=[ Attr(name='values', dtype='float32', var=False, nullable=False, enum_label=None), ], cell_order='col-major', tile_order='col-major', sparse=False, ) ``` The following is a dump of the metadata for the index group: ``` ### Array Metadata ### - Key: adjacency_row_index_datatype - Value: 10 - Type: DataType.UINT32 ### Array Metadata ### - Key: adjacency_row_index_type - Value: uint64 - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: adjacency_scores_datatype - Value: 2 - Type: DataType.UINT32 ### Array Metadata ### - Key: adjacency_scores_type - Value: float32 - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: base_sizes - Value: [0, 10000] - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: dataset_type - Value: vector_search - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: dimension - Value: 128 - Type: DataType.UINT32 ### Array Metadata ### - Key: dtype - Value: float32 - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: feature_datatype - Value: 2 - Type: DataType.UINT32 ### Array Metadata ### - Key: feature_type - Value: float32 - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: id_datatype - Value: 10 - Type: DataType.UINT32 ### Array Metadata ### - Key: id_type - Value: uint64 - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: index_type - Value: Vamana - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: ingestion_timestamps - Value: [0, 1704946748930] - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: num_edges_history - Value: [0, 40000] - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: storage_version - Value: 0.3 - Type: DataType.STRING_UTF8 ### Array Metadata ### - Key: temp_size - Value: 0 - Type: DataType.UINT64 ``` A future story / PR should Test and benchmark -- there seems to have been some bit rot * Use CLI programs for `vamana_index` for benchmarking * The type-erased class would be useful for this * More fully / formally document the array schemas and metadata
1 parent 6acb7ef commit f021d44

File tree

45 files changed

+2033
-374
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+2033
-374
lines changed

external/test_data/nano/vamana/vamana_test_index/__tiledb_group.tdb

Whitespace-only changes.

external/test_data/nano/vamana/vamana_test_index/adjacency_ids/__commits/__1707331479446_1707331479446_8198337176f048a2a119aeac020ed575_21.wrt

Whitespace-only changes.

external/test_data/nano/vamana/vamana_test_index/adjacency_row_index/__commits/__1707331479451_1707331479451_894dde1fc485427d933fe8dd24b067be_21.wrt

Whitespace-only changes.

0 commit comments

Comments
 (0)