feat: add HGraphDiskANNLoader for converting DiskANN index to HGraph#1690
feat: add HGraphDiskANNLoader for converting DiskANN index to HGraph#1690
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant feature by enabling the direct loading of DiskANN indexes into an HGraph structure. This new Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces HGraphDiskANNLoader, a significant and well-implemented feature for migrating from a DiskANN index to HGraph's in-memory format. The code is well-structured, with a clear separation of concerns for loading different components of the DiskANN index. The inclusion of comprehensive unit and feature tests is excellent and greatly contributes to the reliability of the new loader. I have a few suggestions for improvement, mainly around performance, design patterns to enhance maintainability, and robustness.
| bool loaded = false; | ||
| if (common_param_.metric_ == MetricType::METRIC_TYPE_L2SQR) { | ||
| auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_L2SQR>*>(quantizer_ptr); | ||
| pq->LoadCodebook(vsag_codebook.data()); | ||
| loaded = true; | ||
| } else if (common_param_.metric_ == MetricType::METRIC_TYPE_IP) { | ||
| auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_IP>*>(quantizer_ptr); | ||
| pq->LoadCodebook(vsag_codebook.data()); | ||
| loaded = true; | ||
| } else if (common_param_.metric_ == MetricType::METRIC_TYPE_COSINE) { | ||
| auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_COSINE>*>(quantizer_ptr); | ||
| pq->LoadCodebook(vsag_codebook.data()); | ||
| loaded = true; | ||
| } | ||
|
|
||
| if (!loaded) { | ||
| throw VsagException(ErrorType::INTERNAL_ERROR, | ||
| fmt::format("Unsupported metric type for PQ: {}", | ||
| static_cast<int>(common_param_.metric_))); | ||
| } |
There was a problem hiding this comment.
This if-else if block for handling different metric types is not easily extensible and relies on static_cast from void*, which is not type-safe. A more robust design would be to use polymorphism:
- Change
FlattenInterface::GetQuantizerPtr()to returnQuantizerInterface*instead ofvoid*. - Add a virtual method
LoadCodebook(const float* codebook_data)to theQuantizerInterfacebase class. - Implement this method in
ProductQuantizer. - Here, you could then simply call
quantizer->LoadCodebook(...)via the base class pointer, eliminating this block entirely.
| // Read tags (now properly load all tags with correct num_points) | ||
| { | ||
| auto tag_reader = reader_set.Get(DISKANN_TAG_FILE); | ||
| auto tag_data = std::make_unique<char[]>(tag_reader->Size()); | ||
| tag_reader->Read(0, tag_reader->Size(), tag_data.get()); | ||
| std::stringstream tag_stream; | ||
| tag_stream.write(tag_data.get(), static_cast<std::streamsize>(tag_reader->Size())); | ||
| tag_stream.seekg(0); | ||
|
|
||
| LoadDiskANNTags(tag_stream); | ||
| } |
There was a problem hiding this comment.
The tag file is read from the ReaderSet for a second time here. It was already read on lines 321-329 to get the number of points. This is inefficient as it involves a redundant I/O operation and memory allocation.
Consider reading the file content into a buffer once, extracting the header information, and then creating a stringstream from that same buffer to be used by LoadDiskANNTags.
| if (cb_npts < 0 || cb_dim < 0 || cb_npts > 100 || cb_dim > 100) { | ||
| throw VsagException(ErrorType::INVALID_ARGUMENT, "Invalid PQ file header"); | ||
| } |
There was a problem hiding this comment.
This check on cb_npts and cb_dim with arbitrary upper bounds of 100 seems like a leftover debug assertion, as indicated by the comment // Debug output. A more precise validation is already performed in SetupProductQuantizer on line 527. This redundant and vague check should be removed to avoid confusion.
| uint64_t sector_len = | ||
| std::max(static_cast<uint64_t>(DISKANN_SECTOR_LEN), | ||
| (max_node_len + DISKANN_SECTOR_LEN - 1) & ~(DISKANN_SECTOR_LEN - 1)); | ||
| uint64_t nnodes_per_sector = sector_len / max_node_len; |
There was a problem hiding this comment.
The division sector_len / max_node_len could lead to a division-by-zero error if max_node_len is 0. Although this is unlikely with valid index files, adding a check for max_node_len > 0 before this line would make the code more robust against corrupted or malformed input.
| uint64_t nnodes_per_sector = sector_len / max_node_len; | |
| if (max_node_len == 0) { | |
| throw VsagException(ErrorType::INVALID_ARGUMENT, "Invalid node length calculated from layout parameters."); | |
| } | |
| uint64_t nnodes_per_sector = sector_len / max_node_len; |
| void | ||
| SetMaximumDegree(uint32_t maximum_degree) override { | ||
| this->maximum_degree_ = maximum_degree; | ||
| this->code_line_size_ = maximum_degree * sizeof(InnerIdType) + sizeof(uint32_t); | ||
| } |
HGraphDiskANNLoader is a specialized HGraph index that can load existing DiskANN format data and convert it to HGraph's in-memory format. Key features: - Load DiskANN PQ compressed vectors and convert to HGraph basic_flatten_codes_ - Load DiskANN Vamana graph and convert to HGraph bottom_graph_ - Load DiskANN tags for ID mapping - Load DiskANN precise vectors for reorder support - Support both BinarySet and ReaderSet deserialization This enables migration from DiskANN to HGraph without rebuilding indexes. Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
There was a problem hiding this comment.
Pull request overview
Adds a new HGraphDiskANNLoader index type that can deserialize DiskANN artifacts (PQ pivots/codes, Vamana graph, tags, and precise vectors) and convert them into HGraph’s in-memory structures to enable DiskANN→HGraph migration without rebuilding.
Changes:
- Introduces
HGraphDiskANNLoaderand wires it into the engine/factory and string constants. - Extends PQ/flatten/graph components to support loading external PQ codebooks and inserting pre-encoded PQ codes.
- Adds functional + unit tests covering factory creation, deserialization via
BinarySet/ReaderSet, and search behaviors.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_hgraph_diskann_loader.cpp | Adds functional tests for DiskANN→HGraph loader via BinarySet and ReaderSet plus searches. |
| src/algorithm/hgraph_diskann_loader.h | Declares the new loader index type and DiskANN loading/conversion entry points (plus test helpers). |
| src/algorithm/hgraph_diskann_loader.cpp | Implements DiskANN artifact parsing and conversion to HGraph structures. |
| src/algorithm/hgraph_diskann_loader_test.cpp | Adds Catch2 unit tests for PQ/graph/layout conversion correctness. |
| src/factory/engine.cpp | Registers the new index type in Engine::CreateIndex. |
| src/constants.cpp / include/vsag/constants.h / src/inner_string_params.h | Adds new index type string constant and default mapping. |
| include/vsag/index.h | Extends IndexType enum with HGRAPH_DISKANN_LOADER. |
| src/quantization/product_quantization/product_quantizer.h/.cpp | Adds LoadCodebook() for importing external PQ codebooks. |
| src/datacell/flatten_interface.h | Adds InsertCodes() and GetQuantizerPtr() hooks for advanced loading paths. |
| src/datacell/flatten_datacell.h | Implements InsertCodes() + exposes quantizer pointer via GetQuantizerPtr(). |
| src/datacell/graph_datacell.h | Adds SetMaximumDegree() to adjust graph degree metadata. |
| src/index/diskann.h / src/index/diskann_zparameters.h | Qualifies ::diskann::Metric to avoid namespace ambiguity. |
| src/algorithm/hgraph.h | Adds friend class HGraphDiskANNLoader; for internal access. |
|
|
||
| // Create visited list pool | ||
| pool_ = std::make_shared<VisitedListPool>(1, allocator_, diskann_num_points_, allocator_); | ||
|
|
| // Load graph FIRST to get diskann_max_degree_ before loading precise vectors | ||
| // because precise vectors layout depends on max_degree for calculating node offsets | ||
| if (binary_set.Contains(DISKANN_GRAPH)) { | ||
| auto graph_binary = binary_set.Get(DISKANN_GRAPH); | ||
| std::stringstream graph_stream; | ||
| graph_stream.write(reinterpret_cast<const char*>(graph_binary.data.get()), | ||
| static_cast<std::streamsize>(graph_binary.size)); | ||
| graph_stream.seekg(0); | ||
| LoadDiskANNGraph(graph_stream, diskann_num_points_); | ||
| } | ||
|
|
||
| // Now load precise vectors - diskann_max_degree_ is set correctly | ||
| LoadDiskANNPreciseVectors( | ||
| layout_stream, diskann_num_points_, diskann_dim_, diskann_max_degree_); | ||
|
|
| // Calculate subspace_dim | ||
| subspace_dim = static_cast<uint32_t>(dim) / num_subspaces; | ||
|
|
||
| // Read the entire DiskANN codebook | ||
| // DiskANN format: float[num_centers][dim] where each row is a full-dim centroid | ||
| Vector<float> diskann_codebook(static_cast<uint64_t>(num_centers) * dim, allocator_); | ||
| pq_stream.read(reinterpret_cast<char*>(diskann_codebook.data()), | ||
| static_cast<std::streamsize>(diskann_codebook.size() * sizeof(float))); |
| // Get raw pointer to quantizer for advanced operations (e.g., loading external codebook) | ||
| // Returns nullptr if the operation is not supported | ||
| [[nodiscard]] virtual void* | ||
| GetQuantizerPtr() { | ||
| return nullptr; | ||
| } |
| public: | ||
| // Test-only accessor for unit testing | ||
| void | ||
| TestLoadDiskANNPQData(std::istream& pq_stream, | ||
| std::istream& compressed_stream, | ||
| uint64_t num_points) { | ||
| LoadDiskANNPQData(pq_stream, compressed_stream, num_points); | ||
| } | ||
|
|
||
| // Test-only accessor to get the flatten codes interface | ||
| FlattenInterfacePtr | ||
| TestGetFlattenCodes() const { | ||
| return this->basic_flatten_codes_; | ||
| } | ||
|
|
||
| // Test-only accessor for LoadDiskANNGraph | ||
| void | ||
| TestLoadDiskANNGraph(std::istream& graph_stream, uint64_t num_points) { | ||
| LoadDiskANNGraph(graph_stream, num_points); | ||
| } | ||
|
|
||
| // Test-only accessor to get the bottom graph | ||
| GraphInterfacePtr | ||
| TestGetBottomGraph() const { | ||
| return this->bottom_graph_; | ||
| } | ||
|
|
||
| // Test-only accessor to get entry point | ||
| InnerIdType | ||
| TestGetEntryPoint() const { | ||
| return this->entry_point_id_; | ||
| } | ||
|
|
||
| // Test-only accessor to get max degree from loading | ||
| uint64_t | ||
| TestGetDiskANNMaxDegree() const { | ||
| return this->diskann_max_degree_; | ||
| } | ||
|
|
||
| // Test-only accessor for LoadDiskANNPreciseVectors | ||
| void | ||
| TestLoadDiskANNPreciseVectors(std::istream& layout_stream, | ||
| uint64_t num_points, | ||
| uint64_t dim, | ||
| uint64_t max_degree) { | ||
| LoadDiskANNPreciseVectors(layout_stream, num_points, dim, max_degree); | ||
| } | ||
|
|
||
| // Test-only accessor to get precise codes | ||
| FlattenInterfacePtr | ||
| TestGetPreciseCodes() const { | ||
| return this->high_precise_codes_; | ||
| } | ||
|
|
||
| // Test-only accessor to set use_reorder flag | ||
| void | ||
| TestSetUseReorder(bool use_reorder) { | ||
| this->use_reorder_ = use_reorder; | ||
| } | ||
|
|
| // Serialize and load | ||
| auto binary_set = diskann_index->Serialize().value(); | ||
| auto loader_param = | ||
| GenerateHGraphDiskANNLoaderBuildParametersString(metric_type, dim, 16, dim / 4); | ||
| auto loader_index = TestFactory(loader_name, loader_param, true); | ||
| loader_index->Deserialize(binary_set); | ||
|
|
| #include "hgraph_diskann_loader.h" | ||
|
|
||
| #include <catch2/catch_approx.hpp> | ||
| #include <catch2/catch_test_macros.hpp> | ||
| #include <catch2/generators/catch_generators.hpp> | ||
| #include <fstream> | ||
| #include <memory> | ||
| #include <sstream> | ||
|
|
…r private methods - Extract helper functions to reduce code duplication - Add finalize_loading() to consolidate common initialization - Use macros to simplify MetricType switch statements - Rename all private functions to snake_case naming convention Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
link: #1689
HGraphDiskANNLoader is a specialized HGraph index that can load existing DiskANN format data and convert it to HGraph's in-memory format.
Key features:
This enables migration from DiskANN to HGraph without rebuilding indexes.