Skip to content

feat: add HGraphDiskANNLoader for converting DiskANN index to HGraph#1690

Open
inabao wants to merge 4 commits intomainfrom
diskann-adpter
Open

feat: add HGraphDiskANNLoader for converting DiskANN index to HGraph#1690
inabao wants to merge 4 commits intomainfrom
diskann-adpter

Conversation

@inabao
Copy link
Collaborator

@inabao inabao commented Mar 12, 2026

link: #1689
HGraphDiskANNLoader is a specialized HGraph index that can load existing DiskANN format data and convert it to HGraph's in-memory format.

Key features:

  • Load DiskANN PQ compressed vectors and convert to HGraph basic_flatten_codes_
  • Load DiskANN Vamana graph and convert to HGraph bottom_graph_
  • Load DiskANN tags for ID mapping
  • Load DiskANN precise vectors for reorder support
  • Support both BinarySet and ReaderSet deserialization

This enables migration from DiskANN to HGraph without rebuilding indexes.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature by enabling the direct loading of DiskANN indexes into an HGraph structure. This new HGraphDiskANNLoader acts as a bridge, allowing users to seamlessly transition their existing DiskANN-formatted vector indexes to the HGraph system without the overhead of re-indexing. The implementation handles the conversion of DiskANN's quantized vectors, graph topology, and ID mappings, ensuring compatibility and preserving search capabilities within the HGraph framework.

Highlights

  • New HGraphDiskANNLoader Index Type: Introduced a new index type, HGRAPH_DISKANN_LOADER, which is a specialized HGraph index capable of loading and converting existing DiskANN format data into HGraph's in-memory structures. This facilitates migration from DiskANN to HGraph without requiring index rebuilding.
  • DiskANN Data Conversion: The loader supports converting various DiskANN components: PQ compressed vectors to HGraph's basic_flatten_codes_, Vamana graph to HGraph's bottom_graph_, DiskANN tags for ID mapping to HGraph's label_table_, and precise vectors for reorder support to HGraph's high_precise_codes_.
  • Parameter Mapping and Validation: Implemented logic to map DiskANN-specific parameters (like max_degree, ef_construction, pq_dims) to corresponding HGraph parameters, including validation checks for these parameters.
  • Flexible Deserialization: The HGraphDiskANNLoader can deserialize DiskANN data from both BinarySet and ReaderSet formats, offering flexibility in how the index data is provided.
  • Core Data Structure Enhancements: Added InsertCodes and GetQuantizerPtr methods to FlattenInterface and FlattenDataCell to allow direct insertion of pre-encoded PQ codes and access to the underlying quantizer for external codebook loading. Also, SetMaximumDegree was added to GraphDataCell.
  • Comprehensive Testing: New unit and functional tests were added to validate the correct loading of PQ data, graph structures, precise vectors, and various search functionalities (Knn, Range, Filter, CalcDistanceById) across different dimensions and metric types.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • include/vsag/constants.h
    • Added a new constant INDEX_HGRAPH_DISKANN_LOADER to identify the new index type.
  • include/vsag/index.h
    • Included HGRAPH_DISKANN_LOADER in the IndexType enum.
  • src/algorithm/hgraph.h
    • Declared HGraphDiskANNLoader as a friend class to HGraph to allow access to protected members during conversion.
  • src/algorithm/hgraph_diskann_loader.cpp
    • Implemented the core logic for HGraphDiskANNLoader, including methods to check and map external parameters, build HGraph parameters from DiskANN parameters, and deserialize DiskANN PQ data, graph, tags, and precise vectors.
    • Added logic to convert DiskANN's PQ codebook format to VSAG's internal format and to load compressed vectors directly.
    • Included error handling for unsupported DiskANN features like OPQ and BSA.
  • src/algorithm/hgraph_diskann_loader.h
    • Defined the HGraphDiskANNLoader class, inheriting from HGraph, with methods for deserialization and accessors for testing internal states.
  • src/algorithm/hgraph_diskann_loader_test.cpp
    • Added extensive unit tests for HGraphDiskANNLoader, covering PQ data loading, graph structure conversion, precise vector extraction, and various edge cases and topologies.
  • src/constants.cpp
    • Defined the string literal for INDEX_HGRAPH_DISKANN_LOADER.
  • src/datacell/flatten_datacell.h
    • Added InsertCodes method to directly insert pre-encoded codes.
    • Exposed GetQuantizerPtr to allow external access to the underlying quantizer object.
  • src/datacell/flatten_interface.h
    • Added virtual InsertCodes method for direct insertion of pre-encoded codes.
    • Added virtual GetQuantizerPtr method to retrieve a pointer to the quantizer for advanced operations.
  • src/datacell/graph_datacell.h
    • Added SetMaximumDegree method to dynamically adjust the maximum degree of the graph.
  • src/factory/engine.cpp
    • Updated Engine::CreateIndex to recognize and instantiate HGraphDiskANNLoader when the INDEX_HGRAPH_DISKANN_LOADER type is requested.
  • src/index/diskann.h
    • Corrected namespace references for PQFlashIndex, Index, and Metric to use the global ::diskann namespace.
  • src/index/diskann_zparameters.h
    • Corrected namespace reference for Metric to use the global ::diskann namespace.
  • src/inner_string_params.h
    • Added INDEX_TYPE_HGRAPH_DISKANN_LOADER to the list of index type constants and the DEFAULT_MAP.
  • src/quantization/product_quantization/product_quantizer.cpp
    • Implemented LoadCodebook to allow external codebook data to be loaded into the product quantizer.
    • Added ReleaseComputerImpl for proper memory deallocation.
  • src/quantization/product_quantization/product_quantizer.h
    • Declared LoadCodebook method in the ProductQuantizer class interface.
  • tests/test_hgraph_diskann_loader.cpp
    • Added functional tests for HGraphDiskANNLoader, verifying factory creation, deserialization from BinarySet and ReaderSet, and search functionalities (Knn, Range, Filter, CalcDistanceById) with various parameters and dimensions.
Activity
  • No human activity (comments, reviews, progress updates) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces HGraphDiskANNLoader, a significant and well-implemented feature for migrating from a DiskANN index to HGraph's in-memory format. The code is well-structured, with a clear separation of concerns for loading different components of the DiskANN index. The inclusion of comprehensive unit and feature tests is excellent and greatly contributes to the reliability of the new loader. I have a few suggestions for improvement, mainly around performance, design patterns to enhance maintainability, and robustness.

Comment on lines +592 to +611
bool loaded = false;
if (common_param_.metric_ == MetricType::METRIC_TYPE_L2SQR) {
auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_L2SQR>*>(quantizer_ptr);
pq->LoadCodebook(vsag_codebook.data());
loaded = true;
} else if (common_param_.metric_ == MetricType::METRIC_TYPE_IP) {
auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_IP>*>(quantizer_ptr);
pq->LoadCodebook(vsag_codebook.data());
loaded = true;
} else if (common_param_.metric_ == MetricType::METRIC_TYPE_COSINE) {
auto* pq = static_cast<ProductQuantizer<MetricType::METRIC_TYPE_COSINE>*>(quantizer_ptr);
pq->LoadCodebook(vsag_codebook.data());
loaded = true;
}

if (!loaded) {
throw VsagException(ErrorType::INTERNAL_ERROR,
fmt::format("Unsupported metric type for PQ: {}",
static_cast<int>(common_param_.metric_)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This if-else if block for handling different metric types is not easily extensible and relies on static_cast from void*, which is not type-safe. A more robust design would be to use polymorphism:

  1. Change FlattenInterface::GetQuantizerPtr() to return QuantizerInterface* instead of void*.
  2. Add a virtual method LoadCodebook(const float* codebook_data) to the QuantizerInterface base class.
  3. Implement this method in ProductQuantizer.
  4. Here, you could then simply call quantizer->LoadCodebook(...) via the base class pointer, eliminating this block entirely.

Comment on lines +358 to +368
// Read tags (now properly load all tags with correct num_points)
{
auto tag_reader = reader_set.Get(DISKANN_TAG_FILE);
auto tag_data = std::make_unique<char[]>(tag_reader->Size());
tag_reader->Read(0, tag_reader->Size(), tag_data.get());
std::stringstream tag_stream;
tag_stream.write(tag_data.get(), static_cast<std::streamsize>(tag_reader->Size()));
tag_stream.seekg(0);

LoadDiskANNTags(tag_stream);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tag file is read from the ReaderSet for a second time here. It was already read on lines 321-329 to get the number of points. This is inefficient as it involves a redundant I/O operation and memory allocation.

Consider reading the file content into a buffer once, extracting the header information, and then creating a stringstream from that same buffer to be used by LoadDiskANNTags.

Comment on lines +456 to +458
if (cb_npts < 0 || cb_dim < 0 || cb_npts > 100 || cb_dim > 100) {
throw VsagException(ErrorType::INVALID_ARGUMENT, "Invalid PQ file header");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check on cb_npts and cb_dim with arbitrary upper bounds of 100 seems like a leftover debug assertion, as indicated by the comment // Debug output. A more precise validation is already performed in SetupProductQuantizer on line 527. This redundant and vague check should be removed to avoid confusion.

uint64_t sector_len =
std::max(static_cast<uint64_t>(DISKANN_SECTOR_LEN),
(max_node_len + DISKANN_SECTOR_LEN - 1) & ~(DISKANN_SECTOR_LEN - 1));
uint64_t nnodes_per_sector = sector_len / max_node_len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The division sector_len / max_node_len could lead to a division-by-zero error if max_node_len is 0. Although this is unlikely with valid index files, adding a check for max_node_len > 0 before this line would make the code more robust against corrupted or malformed input.

Suggested change
uint64_t nnodes_per_sector = sector_len / max_node_len;
if (max_node_len == 0) {
throw VsagException(ErrorType::INVALID_ARGUMENT, "Invalid node length calculated from layout parameters.");
}
uint64_t nnodes_per_sector = sector_len / max_node_len;

Comment on lines +108 to +112
void
SetMaximumDegree(uint32_t maximum_degree) override {
this->maximum_degree_ = maximum_degree;
this->code_line_size_ = maximum_degree * sizeof(InnerIdType) + sizeof(uint32_t);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This SetMaximumDegree method appears to be unused within this pull request. The graph's max_degree is set via parameters during construction. If this method is not necessary for the current functionality, it should be removed to avoid dead code.

HGraphDiskANNLoader is a specialized HGraph index that can load existing
DiskANN format data and convert it to HGraph's in-memory format.

Key features:
- Load DiskANN PQ compressed vectors and convert to HGraph basic_flatten_codes_
- Load DiskANN Vamana graph and convert to HGraph bottom_graph_
- Load DiskANN tags for ID mapping
- Load DiskANN precise vectors for reorder support
- Support both BinarySet and ReaderSet deserialization

This enables migration from DiskANN to HGraph without rebuilding indexes.

Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
@wxyucs wxyucs requested a review from Copilot March 13, 2026 06:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new HGraphDiskANNLoader index type that can deserialize DiskANN artifacts (PQ pivots/codes, Vamana graph, tags, and precise vectors) and convert them into HGraph’s in-memory structures to enable DiskANN→HGraph migration without rebuilding.

Changes:

  • Introduces HGraphDiskANNLoader and wires it into the engine/factory and string constants.
  • Extends PQ/flatten/graph components to support loading external PQ codebooks and inserting pre-encoded PQ codes.
  • Adds functional + unit tests covering factory creation, deserialization via BinarySet/ReaderSet, and search behaviors.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_hgraph_diskann_loader.cpp Adds functional tests for DiskANN→HGraph loader via BinarySet and ReaderSet plus searches.
src/algorithm/hgraph_diskann_loader.h Declares the new loader index type and DiskANN loading/conversion entry points (plus test helpers).
src/algorithm/hgraph_diskann_loader.cpp Implements DiskANN artifact parsing and conversion to HGraph structures.
src/algorithm/hgraph_diskann_loader_test.cpp Adds Catch2 unit tests for PQ/graph/layout conversion correctness.
src/factory/engine.cpp Registers the new index type in Engine::CreateIndex.
src/constants.cpp / include/vsag/constants.h / src/inner_string_params.h Adds new index type string constant and default mapping.
include/vsag/index.h Extends IndexType enum with HGRAPH_DISKANN_LOADER.
src/quantization/product_quantization/product_quantizer.h/.cpp Adds LoadCodebook() for importing external PQ codebooks.
src/datacell/flatten_interface.h Adds InsertCodes() and GetQuantizerPtr() hooks for advanced loading paths.
src/datacell/flatten_datacell.h Implements InsertCodes() + exposes quantizer pointer via GetQuantizerPtr().
src/datacell/graph_datacell.h Adds SetMaximumDegree() to adjust graph degree metadata.
src/index/diskann.h / src/index/diskann_zparameters.h Qualifies ::diskann::Metric to avoid namespace ambiguity.
src/algorithm/hgraph.h Adds friend class HGraphDiskANNLoader; for internal access.


// Create visited list pool
pool_ = std::make_shared<VisitedListPool>(1, allocator_, diskann_num_points_, allocator_);

Comment on lines +265 to +279
// Load graph FIRST to get diskann_max_degree_ before loading precise vectors
// because precise vectors layout depends on max_degree for calculating node offsets
if (binary_set.Contains(DISKANN_GRAPH)) {
auto graph_binary = binary_set.Get(DISKANN_GRAPH);
std::stringstream graph_stream;
graph_stream.write(reinterpret_cast<const char*>(graph_binary.data.get()),
static_cast<std::streamsize>(graph_binary.size));
graph_stream.seekg(0);
LoadDiskANNGraph(graph_stream, diskann_num_points_);
}

// Now load precise vectors - diskann_max_degree_ is set correctly
LoadDiskANNPreciseVectors(
layout_stream, diskann_num_points_, diskann_dim_, diskann_max_degree_);

Comment on lines +554 to +561
// Calculate subspace_dim
subspace_dim = static_cast<uint32_t>(dim) / num_subspaces;

// Read the entire DiskANN codebook
// DiskANN format: float[num_centers][dim] where each row is a full-dim centroid
Vector<float> diskann_codebook(static_cast<uint64_t>(num_centers) * dim, allocator_);
pq_stream.read(reinterpret_cast<char*>(diskann_codebook.data()),
static_cast<std::streamsize>(diskann_codebook.size() * sizeof(float)));
Comment on lines +106 to +111
// Get raw pointer to quantizer for advanced operations (e.g., loading external codebook)
// Returns nullptr if the operation is not supported
[[nodiscard]] virtual void*
GetQuantizerPtr() {
return nullptr;
}
Comment on lines +84 to +143
public:
// Test-only accessor for unit testing
void
TestLoadDiskANNPQData(std::istream& pq_stream,
std::istream& compressed_stream,
uint64_t num_points) {
LoadDiskANNPQData(pq_stream, compressed_stream, num_points);
}

// Test-only accessor to get the flatten codes interface
FlattenInterfacePtr
TestGetFlattenCodes() const {
return this->basic_flatten_codes_;
}

// Test-only accessor for LoadDiskANNGraph
void
TestLoadDiskANNGraph(std::istream& graph_stream, uint64_t num_points) {
LoadDiskANNGraph(graph_stream, num_points);
}

// Test-only accessor to get the bottom graph
GraphInterfacePtr
TestGetBottomGraph() const {
return this->bottom_graph_;
}

// Test-only accessor to get entry point
InnerIdType
TestGetEntryPoint() const {
return this->entry_point_id_;
}

// Test-only accessor to get max degree from loading
uint64_t
TestGetDiskANNMaxDegree() const {
return this->diskann_max_degree_;
}

// Test-only accessor for LoadDiskANNPreciseVectors
void
TestLoadDiskANNPreciseVectors(std::istream& layout_stream,
uint64_t num_points,
uint64_t dim,
uint64_t max_degree) {
LoadDiskANNPreciseVectors(layout_stream, num_points, dim, max_degree);
}

// Test-only accessor to get precise codes
FlattenInterfacePtr
TestGetPreciseCodes() const {
return this->high_precise_codes_;
}

// Test-only accessor to set use_reorder flag
void
TestSetUseReorder(bool use_reorder) {
this->use_reorder_ = use_reorder;
}

Comment on lines +188 to +194
// Serialize and load
auto binary_set = diskann_index->Serialize().value();
auto loader_param =
GenerateHGraphDiskANNLoaderBuildParametersString(metric_type, dim, 16, dim / 4);
auto loader_index = TestFactory(loader_name, loader_param, true);
loader_index->Deserialize(binary_set);

Comment on lines +15 to +23
#include "hgraph_diskann_loader.h"

#include <catch2/catch_approx.hpp>
#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators.hpp>
#include <fstream>
#include <memory>
#include <sstream>

inabao and others added 2 commits March 13, 2026 16:15
…r private methods

- Extract helper functions to reduce code duplication
- Add finalize_loading() to consolidate common initialization
- Use macros to simplify MetricType switch statements
- Rename all private functions to snake_case naming convention

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants