feat: add HSNW index deserialization NOT READY FOR REVIEW #6531

BorysTheDev · 2026-02-03T14:12:00Z

No description provided.

augmentcode · 2026-02-03T14:16:15Z

🤖 Augment PR Summary

Summary: This PR adds initial support for deserializing global HNSW (vector) index graph structure from RDB snapshots.

Changes:

Extends HnswVectorIndex/HnswlibAdapter with metadata setters, graph restoration (RestoreFromNodes), and vector payload backfilling (UpdateVectorData).
Updates RDB loading to parse RDB_OPCODE_VECTOR_INDEX, rebuild node/link topology in-place, and mark indices as “restored”.
Parses optional hnsw_metadata from JSON AUX search-index definitions and applies it to created HNSW indices.
Adjusts post-load rebuild so restored indices populate vectors via UpdateVectorData instead of mutating the graph via Add.
Adds registry bookkeeping to track/clear restored indices across the load lifecycle.
Minor refactor in RDB save path to pass EngineShard by const reference when collecting search index definitions.

Technical Notes: Restoration is gated on matching shard counts (since GlobalDocId encodes shard id) and still relies on the standard rebuild pass to populate vector memory.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 4 suggestions posted.

Comment augment review to trigger a new review at any time.

src/core/search/hnsw_index.cc

Copilot

Pull request overview

This pull request implements HNSW (Hierarchical Navigable Small World) vector index deserialization from RDB snapshots. The feature allows HNSW graph structures to be saved and restored, improving startup times by avoiding full index rebuilds. The implementation uses a two-phase approach: first restoring the graph structure from the RDB file, then populating vector data during the rebuild phase.

Changes:

Added HNSW graph serialization/deserialization infrastructure with metadata and node data structures
Implemented graph restoration logic that reconstructs HNSW links without rebuilding from scratch
Modified RDB load/save to handle HNSW index data as a new opcode (RDB_OPCODE_VECTOR_INDEX)
Added tracking mechanism to distinguish between restored and newly-created indices during rebuild

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
src/server/search/global_hnsw_index.h	Added methods to track restored indices and set metadata on HNSW indices
src/server/search/global_hnsw_index.cc	Implemented restored indices tracking, metadata setting, and registry cleanup
src/server/search/doc_index.cc	Modified rebuild to use UpdateVectorData for restored indices instead of Add
src/server/rdb_save.cc	Changed CollectSearchIndices parameter from pointer to const reference
src/server/rdb_load.cc	Implemented HNSW graph deserialization, metadata parsing from JSON, and post-load cleanup
src/core/search/hnsw_index.h	Added RestoreFromNodes, UpdateVectorData, SetMetadata methods to support restoration
src/core/search/hnsw_index.cc	Implemented low-level graph restoration with direct memory manipulation and vector data updates

src/server/search/global_hnsw_index.cc

src/core/search/hnsw_index.cc

src/server/rdb_load.cc

src/core/search/hnsw_index.cc

src/server/rdb_load.cc

src/core/search/hnsw_index.cc

src/server/rdb_load.cc

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

src/server/rdb_load.cc

src/server/search/global_hnsw_index.h

src/server/rdb_save.cc

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

src/server/rdb_load.cc

src/server/rdb_save.cc

src/server/search/doc_index.cc

src/server/rdb_load.cc

src/core/search/hnsw_index.cc

src/server/rdb_save.cc

src/core/search/hnsw_index.cc

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

src/server/rdb_load.cc

src/server/rdb_save.cc

src/server/search/doc_index.cc

tests/dragonfly/instance.py

src/server/search/doc_index.cc

src/core/search/hnsw_index.cc

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

src/core/search/hnsw_index.h

src/server/rdb_load.cc

src/core/search/hnsw_index.cc

src/server/rdb_load.cc

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

src/server/rdb_load.cc

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

src/server/search/doc_index.cc

src/server/rdb_load.cc

src/core/search/hnsw_index.cc

Copilot · 2026-02-10T14:36:58Z

src/core/search/hnsw_index.cc

+    // Set the metadata for the graph - use actual restored count for consistency
+    world_.cur_element_count.store(restored_count);
+    world_.maxlevel_ = metadata.maxlevel;
+    world_.enterpoint_node_ = metadata.enterpoint_node;
+


RestoreFromNodes sets world_.cur_element_count to restored_count without checking consistency with metadata.cur_element_count or validating that restored internal_ids are contiguous [0..N). If the nodes vector is missing entries or contains out-of-range/sparse internal_ids, the index state can become inconsistent. Consider validating ordering/contiguity and using metadata.cur_element_count (or max_internal_id + 1) only when consistent, otherwise abort restoration and rebuild from data.

Copilot · 2026-02-10T14:36:58Z

src/server/rdb_load.cc

+  // Wait for all mappings to be applied
+  shard_set->RunBriefInParallel([](EngineShard*) {});
+
+  // Rebuild all search indices - for restored HNSW indices, this will populate vectors


shard_set->RunBriefInParallel([](EngineShard*) {}) does not wait for callbacks enqueued via shard_set->Add(...) (it dispatches independently via DispatchBrief). If the intent is to wait until the RestoreKeyIndex callbacks run before continuing, use a shard-queue-based barrier (e.g., AwaitRunningOnShardQueue with a no-op) or rely on the ordering guarantee of subsequently enqueued shard-queue work and remove this misleading call.

Suggested change

// Wait for all mappings to be applied

shard_set->RunBriefInParallel([](EngineShard*) {});

// Rebuild all search indices - for restored HNSW indices, this will populate vectors

// Rebuild all search indices - for restored HNSW indices, this will populate vectors.

// AwaitRunningOnShardQueue runs after all previously enqueued shard-queue work (including

// RestoreKeyIndex callbacks scheduled via shard_set->Add).

Copilot · 2026-02-10T14:36:58Z

src/core/search/hnsw_index.cc

+    std::vector<size_t> restored_internal_ids;  // Track for cleanup on abort
+    restored_internal_ids.reserve(nodes.size());


restored_internal_ids is populated but never used (comment suggests cleanup on abort, but there is no abort path). Either implement the intended rollback/cleanup on partial restore failures or remove the unused tracking to reduce confusion.

Suggested change

std::vector<size_t> restored_internal_ids; // Track for cleanup on abort

restored_internal_ids.reserve(nodes.size());

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Copilot · 2026-02-11T10:19:36Z

src/core/search/hnsw_index.cc

+      // Restore links for layer 0
+      if (!node.levels_links.empty()) {
+        auto* ll0 = world_.get_linklist0(internal_id);
+        world_.setListCount(ll0, node.levels_links[0].size());
+        auto* links0 = reinterpret_cast<uint32_t*>(ll0 + 1);
+        std::copy(node.levels_links[0].begin(), node.levels_links[0].end(), links0);
+      }
+
+      // Restore links for upper layers
+      for (int lvl = 1; lvl <= node.level && lvl < static_cast<int>(node.levels_links.size());
+           ++lvl) {
+        auto* ll = world_.get_linklist(internal_id, lvl);
+        world_.setListCount(ll, node.levels_links[lvl].size());
+        auto* links = reinterpret_cast<uint32_t*>(ll + 1);
+        std::copy(node.levels_links[lvl].begin(), node.levels_links[lvl].end(), links);
+      }


RestoreFromNodes() writes restored neighbor lists directly into HNSW link-list memory (setListCount + std::copy) without validating that levels_links[lvl].size() fits the per-layer capacity (maxM0_ for level 0, maxM_ for upper layers) and that level values are sane. With corrupted input this can overflow the allocated link-list buffers and corrupt memory. Add strict validation/clamping for level/links_num before copying, and prefer returning an RDB corruption error instead of crashing/overrunning.

Copilot · 2026-02-11T10:19:37Z

src/core/search/hnsw_index.cc

+      // Initialize vector data to a safe default. This prevents crashes during KNN search
+      // if UpdateVectorData is not called for this node (e.g., document was deleted).
+      // In copy mode: zero the vector memory. In borrowed mode: set pointer to nullptr.
+      if (world_.copy_vector_) {
+        char* data_ptr = world_.data_vector_memory_ + internal_id * world_.data_size_;
+        memset(data_ptr, 0, world_.data_size_);
+      } else {
+        char* ptr_location = world_.getDataPtrByInternalId(internal_id);
+        char* null_ptr = nullptr;
+        memcpy(ptr_location, &null_ptr, sizeof(void*));
+      }


In borrowed-vector mode (!world_.copy_vector_), RestoreFromNodes() initializes each node’s vector pointer to nullptr. HierarchicalNSW::searchKnn() (in hnsw_alg.h) calls the distance function with getDataByInternalId() results without null checks, so any node left with a null vector pointer can segfault during KNN search. Use a safe non-null fallback (e.g. point to a zero-vector buffer owned by the index) and/or explicitly mark nodes as deleted until UpdateVectorData() succeeds (and ensure missing documents are removed).

Copilot · 2026-02-11T10:19:37Z

src/server/search/doc_index.cc

+
+  for (const auto& [key, local_id] : key_index_.GetDocKeysMap()) {
+    auto it = db_slice.FindMutable(op_args.db_cntx, key, base_->GetObjCode());
+    if (!it || !IsValid(it->it)) {


When a key from the restored key-index mapping is missing from the DB (!it || !IsValid), the code just increments missing_documents and continues. For restored HNSW graphs this leaves nodes in the index with unpopulated vector data (and in borrowed-vector mode can lead to crashes during KNN). Consider removing/marking those nodes as deleted in the HNSW index when the backing document is missing, or otherwise ensuring they can’t participate in search.

Suggested change

if (!it || !IsValid(it->it)) {

if (!it || !IsValid(it->it)) {

// Backing document is missing or invalid. Ensure corresponding HNSW nodes

// are removed/marked-deleted so they cannot participate in search.

GlobalDocId global_id =

search::CreateGlobalDocId(EngineShard::tlocal()->shard_id(), local_id);

for (const auto& [field_ident, field_info] : GetIndexedHnswFields(base_->schema)) {

if (auto index =

GlobalHnswIndexRegistry::Instance().Get(index_name, field_info.short_name);

index) {

index->RemoveVectorData(global_id);

}

}

src/server/rdb_load.cc

Copilot · 2026-02-11T10:19:38Z

src/server/search/doc_index.cc

+void ShardDocIndex::RebuildGlobalVectorIndices(std::string_view index_name, const OpArgs& op_args) {
+  // Don't run loop if no vector fields are present
+  if (std::ranges::empty(GetIndexedHnswFields(base_->schema)))
+    return;
+
+  // Check if any vector index was restored from RDB - if so, iterate by index keys
+  // which is more efficient than traversing the entire database
+  bool any_restored = false;
+  for (const auto& [field_ident, field_info] : GetIndexedHnswFields(base_->schema)) {
+    if (auto index = GlobalHnswIndexRegistry::Instance().Get(index_name, field_info.short_name);
+        index && index->IsRestored()) {
+      any_restored = true;
+      break;
+    }
+  }
+
+  if (any_restored) {
+    // Iterate by index keys - more efficient for restored indices
+    LOG(INFO) << "Restoring vector index '" << index_name << "' from serialized graph on shard "
+              << EngineShard::tlocal()->shard_id();
+    RebuildGlobalVectorIndicesByIndexKeys(index_name, op_args);
+  } else {
+    // Iterate by database - needed when building new index
+    LOG(INFO) << "Rebuilding vector index '" << index_name << "' from database on shard "
+              << EngineShard::tlocal()->shard_id() << " (no serialized graph available)";
+    RebuildGlobalVectorIndicesByDatabase(index_name, op_args);
+  }


RebuildGlobalVectorIndices()/...ByIndexKeys() implement the intended post-RDB vector population via UpdateVectorData, but there’s no call site for RebuildGlobalVectorIndices() in the codebase (a repo-wide search only finds this definition). The current index build flow (search::IndexBuilder::VectorLoop) still calls AddDocToGlobalVectorIndex() which uses HnswVectorIndex::Add() even when an index was restored. As a result, restored graphs won’t get their vector data populated via UpdateVectorData. Wire this function into the rebuild/indexing path (or teach AddDocToGlobalVectorIndex/VectorLoop to call UpdateVectorData when IsRestored() is set).

Copilot · 2026-02-11T10:19:38Z

src/server/rdb_load.cc

+  // Apply pending index key-to-DocId mappings before rebuilding indices
+  // Group mappings by shard_id for efficient dispatch
+  absl::flat_hash_map<uint32_t, std::vector<const PendingIndexMapping*>> mappings_by_shard;
+  for (const auto& pim : index_mappings) {
+    mappings_by_shard[pim.shard_id].push_back(&pim);
+  }
+
+  // Apply mappings on each shard (assuming same shard count as when snapshot was taken)
+  for (const auto& [shard_id, shard_mappings] : mappings_by_shard) {
+    if (shard_id >= shard_set->size()) {
+      LOG(ERROR) << "Invalid shard_id in RDB: " << shard_id << " (max: " << shard_set->size() - 1
+                 << "). Skipping index mappings for this shard.";
+      continue;
+    }
+    shard_set->Add(shard_id, [shard_mappings]() {
+      EngineShard* es = EngineShard::tlocal();
+      for (const auto* pim : shard_mappings) {
+        if (auto* index = es->search_indices()->GetIndex(pim->index_name); index) {
+          index->RestoreKeyIndex(pim->mappings);
+          VLOG(1) << "Restored " << pim->mappings.size() << " key mappings for index "
+                  << pim->index_name << " on shard " << pim->shard_id;
+        }
+      }
+    });
+  }
+  // Wait for all mappings to be applied
+  shard_set->RunBriefInParallel([](EngineShard*) {});
+
+  // Rebuild all search indices - for restored HNSW indices, this will populate vectors
+  shard_set->AwaitRunningOnShardQueue([](EngineShard* es) {
    OpArgs op_args{es, nullptr,
                   DbContext{&namespaces->GetDefaultNamespace(), 0, GetCurrentTimeMs()}};
    es->search_indices()->RebuildAllIndices(op_args);
  });


RestoreKeyIndex() is applied just before calling RebuildAllIndices(), but ShardDocIndex::Rebuild() currently resets key_index_ (key_index_ = DocKeyIndex{}), discarding the restored key->DocId mapping. This breaks the core requirement for HNSW restoration (DocIds must match the serialized graph). Either rebuild indices while preserving the restored DocIds (iterate restored mappings and add docs with those IDs), or avoid calling the rebuild path that reassigns DocIds when restoration is enabled.

src/server/rdb_load.cc

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Copilot · 2026-02-12T09:01:54Z

src/server/rdb_load.cc

+        CHECK(metadata);
+
+        // Restore the HNSW graph directly and mark as restored
+        hnsw_index->RestoreFromNodes(nodes, *metadata);
+
+        LOG(INFO) << "Restored HNSW index " << index_key << " with " << nodes.size() << " nodes";


Using CHECK(metadata) will crash the server on load if the AUX metadata is missing/invalid (or if the RDB contains vector index opcodes without matching metadata). Prefer a graceful fallback: log a warning, skip restoring this HNSW graph, and let the index rebuild from data (or return a load error) instead of aborting the process.

Suggested change

CHECK(metadata);

// Restore the HNSW graph directly and mark as restored

hnsw_index->RestoreFromNodes(nodes, *metadata);

LOG(INFO) << "Restored HNSW index " << index_key << " with " << nodes.size() << " nodes";

if (!metadata) {

LOG(WARNING) << "Missing HNSW metadata for index \"" << index_key

<< "\" (index_name=\"" << index_name << "\", field_name=\"" << field_name

<< "\") while restoring " << nodes.size()

<< " HNSW nodes from RDB. Skipping graph restore; index will be rebuilt "

<< "from base data.";

} else {

// Restore the HNSW graph directly and mark as restored

hnsw_index->RestoreFromNodes(nodes, *metadata);

LOG(INFO) << "Restored HNSW index " << index_key << " with " << nodes.size()

<< " nodes";

}

Copilot · 2026-02-12T09:01:55Z

src/core/search/hnsw_index.cc

+        auto* ll = world_.get_linklist(internal_id, lvl);
+        world_.setListCount(ll, node.levels_links[lvl].size());
+        auto* links = reinterpret_cast<uint32_t*>(ll + 1);
+        std::copy(node.levels_links[lvl].begin(), node.levels_links[lvl].end(), links);
+      }


RestoreFromNodes copies upper-layer links without validating the link count against the per-layer capacity (maxM_). This is another potential buffer overflow with malformed input. Validate counts and fail restoration (or return an error) instead of copying blindly.

Copilot · 2026-02-12T09:01:55Z

src/server/search/index_builder.cc

+      OpArgs op_args{EngineShard::tlocal(), nullptr, db_cntx};
+      index_->RebuildGlobalVectorIndices(index_->base_->name, op_args);


VectorLoop calls RebuildGlobalVectorIndices directly when an index is restored, bypassing the shard FiberQueue serialization used below via shard_set->Await. This contradicts the mutex-ordering rationale in the comment (and can reintroduce deadlocks if multiple builders/updates touch multiple global vector indices concurrently). Consider dispatching the rebuild/update work through the shard queue as well, similar to the non-restored path.

Suggested change

OpArgs op_args{EngineShard::tlocal(), nullptr, db_cntx};

index_->RebuildGlobalVectorIndices(index_->base_->name, op_args);

// Serialize rebuild of global vector indices through the shard queue to preserve

// a single linear mutex acquisition order across all global indices.

auto shard_cb = [this, db_cntx] {

OpArgs op_args{EngineShard::tlocal(), nullptr, db_cntx};

index_->RebuildGlobalVectorIndices(index_->base_->name, op_args);

};

shard_set->Await(EngineShard::tlocal()->shard_id(), std::move(shard_cb));

Copilot · 2026-02-12T09:01:55Z

src/server/search/doc_index.cc

+    auto it = db_slice.FindMutable(op_args.db_cntx, key, base_->GetObjCode());
+    if (!it || !IsValid(it->it)) {
+      ++missing_documents;
+      continue;
+    }
+
+    PrimeValue& pv = it->it->second;
+    auto doc = GetAccessor(op_args.db_cntx, pv);
+    GlobalDocId global_id = search::CreateGlobalDocId(EngineShard::tlocal()->shard_id(), local_id);


In the restored-index path, missing/invalid documents are counted and skipped, but their corresponding HNSW nodes remain in the restored graph. For HASH (borrowed vector) indices, those nodes still have nullptr vector pointers from RestoreFromNodes, and KNN can dereference nullptr via getDataByInternalId(). Consider removing/mark-deleting the node(s) in the global HNSW index when a document is missing/invalid to avoid crashes/ghost results.

Suggested change

auto it = db_slice.FindMutable(op_args.db_cntx, key, base_->GetObjCode());

if (!it || !IsValid(it->it)) {

++missing_documents;

continue;

}

PrimeValue& pv = it->it->second;

auto doc = GetAccessor(op_args.db_cntx, pv);

GlobalDocId global_id = search::CreateGlobalDocId(EngineShard::tlocal()->shard_id(), local_id);

GlobalDocId global_id =

search::CreateGlobalDocId(EngineShard::tlocal()->shard_id(), local_id);

auto it = db_slice.FindMutable(op_args.db_cntx, key, base_->GetObjCode());

if (!it || !IsValid(it->it)) {

++missing_documents;

// Ensure any existing HNSW nodes for this document are removed/marked-deleted so that

// restored graphs do not retain nodes with nullptr vector pointers.

for (const auto& [field_ident, field_info] : GetIndexedHnswFields(base_->schema)) {

if (auto index =

GlobalHnswIndexRegistry::Instance().Get(index_name, field_info.short_name);

index) {

index->MarkDeleted(global_id);

}

}

continue;

}

PrimeValue& pv = it->it->second;

auto doc = GetAccessor(op_args.db_cntx, pv);

Copilot · 2026-02-12T09:01:56Z

src/server/rdb_load.cc

      uint64_t elements_number;
      SET_OR_RETURN(LoadLen(nullptr), elements_number);

+      // Only restore if flag enabled and shard count matches (GlobalDocId encodes shard_id)
+      bool should_restore = GetFlag(FLAGS_deserialize_hnsw_index) && shard_count_ > 0 &&


elements_number is read from the RDB and then used to reserve memory and drive loops. With a corrupted/untrusted snapshot this can cause huge allocations or very long loops (DoS/OOM). Add sanity checks before reserving/iterating (e.g. cap by metadata.max_elements, by a hard limit, and/or return rdb_file_corrupted on unreasonable values).

Copilot · 2026-02-12T09:01:56Z

src/server/rdb_load.cc

+    for (const auto& [shard_id, shard_mappings] : index_mappings) {
+      if (shard_id >= shard_set->size()) {
+        LOG(ERROR) << "Invalid shard_id in RDB: " << shard_id << " (max: " << shard_set->size() - 1
+                   << "). Skipping index mappings for this shard.";
+        continue;
+      }
+      shard_set->Add(shard_id, [&shard_mappings]() {
+        EngineShard* es = EngineShard::tlocal();
+        for (const auto& pim : shard_mappings) {
+          if (auto* index = es->search_indices()->GetIndex(pim.index_name); index) {
+            index->RestoreKeyIndex(pim.mappings);
+            VLOG(1) << "Restored " << pim.mappings.size() << " key mappings for index "
+                    << pim.index_name << " on shard " << es->shard_id();
+          }
+        }
+      });
+    }
+    // Wait for all mappings to be applied
+    shard_set->RunBriefInParallel([](EngineShard*) {});


RunBriefInParallel() does not guarantee that FiberQueue work enqueued via shard_set->Add() has been drained. As written, the code may proceed to RebuildAllIndices while mapping-restore tasks are still pending, leading to races and inconsistent key_index_ state. Use a proper barrier on the shard FiberQueues (e.g. AwaitRunningOnShardQueue / per-shard sentinel callbacks with a BlockingCounter) to ensure all Add() tasks completed before continuing.

Suggested change

for (const auto& [shard_id, shard_mappings] : index_mappings) {

if (shard_id >= shard_set->size()) {

LOG(ERROR) << "Invalid shard_id in RDB: " << shard_id << " (max: " << shard_set->size() - 1

<< "). Skipping index mappings for this shard.";

continue;

}

shard_set->Add(shard_id, [&shard_mappings]() {

EngineShard* es = EngineShard::tlocal();

for (const auto& pim : shard_mappings) {

if (auto* index = es->search_indices()->GetIndex(pim.index_name); index) {

index->RestoreKeyIndex(pim.mappings);

VLOG(1) << "Restored " << pim.mappings.size() << " key mappings for index "

<< pim.index_name << " on shard " << es->shard_id();

}

}

});

}

// Wait for all mappings to be applied

shard_set->RunBriefInParallel([](EngineShard*) {});

shard_set->AwaitRunningOnShardQueue([&index_mappings, shard_set](EngineShard* es) {

const uint32_t shard_id = es->shard_id();

if (shard_id >= shard_set->size()) {

LOG(ERROR) << "Invalid shard_id in RDB: " << shard_id << " (max: " << shard_set->size() - 1

<< "). Skipping index mappings for this shard.";

return;

}

auto it = index_mappings.find(shard_id);

if (it == index_mappings.end()) {

return;

}

const auto& shard_mappings = it->second;

for (const auto& pim : shard_mappings) {

if (auto* index = es->search_indices()->GetIndex(pim.index_name); index) {

index->RestoreKeyIndex(pim.mappings);

VLOG(1) << "Restored " << pim.mappings.size() << " key mappings for index "

<< pim.index_name << " on shard " << es->shard_id();

}

}

});

Copilot · 2026-02-12T09:01:57Z

src/core/search/hnsw_index.cc

+      if (!node.levels_links.empty()) {
+        auto* ll0 = world_.get_linklist0(internal_id);
+        world_.setListCount(ll0, node.levels_links[0].size());
+        auto* links0 = reinterpret_cast<uint32_t*>(ll0 + 1);
+        std::copy(node.levels_links[0].begin(), node.levels_links[0].end(), links0);


RestoreFromNodes copies level-0 links without validating the link count against the fixed linklist capacity (maxM0_) for the index. With corrupted snapshots this can overflow the linklist buffer and corrupt memory. Add bounds checks (and treat violations as snapshot corruption) before setListCount()/copy.

Copilot · 2026-02-12T09:01:57Z

src/server/search/doc_index.cc

+  // Find max doc_id to size the keys_ vector appropriately
+  DocId max_id = 0;
+  for (const auto& [key, doc_id] : mappings) {
+    max_id = std::max(max_id, doc_id);
+  }
+
+  // Resize keys_ to accommodate all doc_ids
+  keys_.resize(max_id + 1);
+  last_id_ = max_id + 1;
+
+  // Restore the mappings
+  for (const auto& [key, doc_id] : mappings) {
+    keys_[doc_id] = key;
+    ids_[key] = doc_id;
+  }
+
+  // Build free_ids_ list for any gaps in the id sequence


DocKeyIndex::Restore computes max_id starting from 0 and then unconditionally does keys_.resize(max_id + 1). If mappings is empty, this leaves keys_ sized to 1 and last_id_=1, which makes subsequent Add() start at id=1 and leaves id=0 as a permanent gap. Handle the empty-mappings case explicitly (clear state, keep keys_ empty, last_id_=0).

Suggested change

// Find max doc_id to size the keys_ vector appropriately

DocId max_id = 0;

for (const auto& [key, doc_id] : mappings) {

max_id = std::max(max_id, doc_id);

}

// Resize keys_ to accommodate all doc_ids

keys_.resize(max_id + 1);

last_id_ = max_id + 1;

// Restore the mappings

for (const auto& [key, doc_id] : mappings) {

keys_[doc_id] = key;

ids_[key] = doc_id;

}

// Build free_ids_ list for any gaps in the id sequence

// Reset current state before restoring.

keys_.clear();

ids_.clear();

free_ids_.clear();

last_id_ = 0;

// If there are no mappings to restore, keep the index empty.

if (mappings.empty()) {

return;

}

// Find max doc_id to size the keys_ vector appropriately.

DocId max_id = 0;

for (const auto& [key, doc_id] : mappings) {

max_id = std::max(max_id, doc_id);

}

// Resize keys_ to accommodate all doc_ids.

keys_.resize(max_id + 1);

last_id_ = max_id + 1;

// Restore the mappings.

for (const auto& [key, doc_id] : mappings) {

keys_[doc_id] = key;

ids_[key] = doc_id;

}

// Build free_ids_ list for any gaps in the id sequence.

Copilot · 2026-02-12T09:01:57Z

src/server/search/doc_index.cc

+  // Restore the mappings
+  for (const auto& [key, doc_id] : mappings) {
+    keys_[doc_id] = key;
+    ids_[key] = doc_id;
+  }


DocKeyIndex::Restore does not clear existing ids_/free_ids_/keys_ before applying mappings, and it never resets free_ids_ prior to pushing gaps. If Restore() is called on a non-empty instance (e.g. due to duplicate AUX/restore flow), this will leave stale entries and duplicate free_ids_. Consider starting Restore() with a full reset (ids_.clear(), keys_.clear(), free_ids_.clear(), last_id_=0) and then rebuilding from mappings.

src/server/rdb_test.cc

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

src/server/rdb_save.cc:1456

CollectSearchIndices only emits HNSW metadata for the first vector field, but SliceSnapshot::SerializeGlobalHnswIndices() serializes all global HNSW indices (one per index_name:field_name). On load, vector-index restoration expects metadata per serialized index_key (currently CHECK(metadata)), so additional vector fields will crash or skip restore. Emit metadata for every vector field (or include metadata alongside the VECTOR_INDEX opcode) and align loader expectations accordingly.

    // Collect HNSW metadata for vector field (first one found), for now we don't support multiple
    // vector fields per index serialization
    for (const auto& [fident, finfo] : index_info.base_index.schema.fields) {
      if (finfo.type == search::SchemaField::VECTOR &&
          !(finfo.flags & search::SchemaField::NOINDEX)) {
        if (auto hnsw_index = GlobalHnswIndexRegistry::Instance().Get(index_name, finfo.short_name);
            hnsw_index) {
          auto meta = hnsw_index->GetMetadata();
          TmpJson meta_json;
          meta_json["index_name"] = index_name;
          meta_json["field_name"] = finfo.short_name;
          meta_json["max_elements"] = meta.max_elements;
          meta_json["cur_element_count"] = meta.cur_element_count;
          meta_json["maxlevel"] = meta.maxlevel;
          meta_json["enterpoint_node"] = meta.enterpoint_node;
          hnsw_index_metadata->emplace_back(meta_json.to_string());
          break;
        }

Copilot · 2026-02-12T19:30:51Z

src/server/rdb_load.cc

+      shard_set->Add(shard_id, [&shard_mappings]() {
+        EngineShard* es = EngineShard::tlocal();
+        for (const auto& pim : shard_mappings) {
+          if (auto* index = es->search_indices()->GetIndex(pim.index_name); index) {
+            index->RestoreKeyIndex(pim.mappings);
+            VLOG(1) << "Restored " << pim.mappings.size() << " key mappings for index "
+                    << pim.index_name << " on shard " << es->shard_id();
+          }
+        }
+      });
+    }
+    // Wait for all mappings to be applied
+    shard_set->RunBriefInParallel([](EngineShard*) {});
+  }
+
+  // Rebuild all search indices - for restored HNSW indices, VectorLoop will
+  // use UpdateVectorData instead of Add
+  shard_set->AwaitRunningOnShardQueue([](EngineShard* es) {
    OpArgs op_args{es, nullptr,
                   DbContext{&namespaces->GetDefaultNamespace(), 0, GetCurrentTimeMs()}};
    es->search_indices()->RebuildAllIndices(op_args);
  });


RestoreKeyIndex() is applied before RebuildAllIndices(), but ShardDocIndex::Rebuild() currently resets key_index_ to a fresh DocKeyIndex{} (src/server/search/doc_index.cc:324). That discards the restored key→DocId mapping, so restored HNSW graphs (which are keyed by GlobalDocId) won’t match the DocIds assigned during rebuild. The rebuild path needs to preserve/restored DocIds (e.g., rebuild indices by iterating the restored mapping and adding docs with fixed ids) rather than reassigning them.

Copilot · 2026-02-12T19:30:52Z

src/server/rdb_load.cc

+  // Clear restored flags on all HNSW indices now that all shards finished
+  for (const auto& [key, index] : GlobalHnswIndexRegistry::Instance().GetAll()) {
+    index->SetRestored(false);
+  }
+


SetRestored(false) is executed immediately after calling RebuildAllIndices(), but rebuild starts async builders and returns right away. This can clear the restored flag before IndexBuilder::VectorLoop() runs, causing restored graphs to be treated as non-restored (rebuilding instead of UpdateVectorData) and also races with concurrent reads of IsRestored(). Defer clearing until restoration/indexing has completed (or clear per-index when its rebuild finished).

Suggested change

// Clear restored flags on all HNSW indices now that all shards finished

for (const auto& [key, index] : GlobalHnswIndexRegistry::Instance().GetAll()) {

index->SetRestored(false);

}

Copilot · 2026-02-12T19:30:52Z

src/server/rdb_load.cc


-  LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition");
+void RdbLoader::LoadSearchIndexDefFromAux(string&& def) {
+  LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition", true);


LoadSearchIndexDefFromAux now ignores all FT.CREATE dispatch errors. This can hide real failures (e.g., bad schema/options), leaving the system without an index while later code assumes it exists for restoration. If this is only to tolerate duplicate FT.CREATE across shard files, consider restoring the previous “create-once” guard or selectively ignoring only the expected "Index already exists" error while still logging/handling other errors.

Suggested change

LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition", true);

// Avoid issuing duplicate FT.CREATE commands across shard files by guarding on index name.

// This allows us to treat all other FT.CREATE errors as real failures instead of ignoring them.

bool should_create = true;

// Try to extract the index name from the FT.CREATE definition: "FT.CREATE <index> ..."

// We only use this guard when the string starts with FT.CREATE and has at least two tokens.

auto parts =

absl::StrSplit(def, absl::ByAnyChar(" \t\r\n"), absl::SkipWhitespace());

if (parts.size() >= 2 && parts[0] == "FT.CREATE") {

std::string index_name(std::string(parts[1]));

{

std::lock_guard lk(search_index_mu_);

auto [it, inserted] = created_search_indices_.insert(index_name);

should_create = inserted;

}

}

if (!should_create) {

// Index was already created from another shard's aux data; skip duplicate FT.CREATE.

return;

}

LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition",

/*ignore_errors=*/false);

Copilot · 2026-02-12T19:30:52Z

src/server/rdb_load.cc

      for (uint64_t i = 0; i < mapping_count; ++i) {
        string key;
        SET_OR_RETURN(FetchGenericString(), key);
        uint64_t doc_id;
        SET_OR_RETURN(LoadLen(nullptr), doc_id);
        pim.mappings.emplace_back(std::move(key), static_cast<search::DocId>(doc_id));
      }


doc_id is loaded as uint64_t and then cast to search::DocId without bounds checking. A malformed/hostile RDB could wrap/truncate IDs and/or make DocKeyIndex::Restore allocate an enormous keys_ vector (OOM). Validate doc_id <= std::numeric_limits<search::DocId>::max() (and consider also capping max_id / mapping_count) before storing/applying mappings.

Copilot · 2026-02-12T19:30:52Z

src/server/search/doc_index.cc

+    for (const auto& [field_ident, field_info] : GetIndexedHnswFields(base_->schema)) {
+      if (auto index = GlobalHnswIndexRegistry::Instance().Get(index_name, field_info.short_name);
+          index) {
+        bool success = index->UpdateVectorData(global_id, *doc, field_ident);
+        if (success) {
+          ++successful_updates;
+          if (!index->IsVectorCopied()) {
+            pv.SetOmitDefrag(true);
+          }
+        } else {
+          ++failed_updates;
+        }
+      }
+    }


RebuildGlobalVectorIndicesByIndexKeys calls UpdateVectorData for every indexed HNSW field as long as the index exists, even if that particular HnswVectorIndex wasn’t restored (no graph nodes). In that case UpdateVectorData won’t populate anything (label not found) but the current code still counts it as success. Consider branching per-field: if index->IsRestored() use UpdateVectorData, otherwise fall back to Add (or rebuild-by-DB for that field).

Copilot · 2026-02-12T19:30:53Z

src/server/rdb_load.cc

+      std::vector<search::HnswNodeData> nodes;
+      if (should_restore) {
+        nodes.reserve(elements_number);
+      }


When restoration is enabled, nodes.reserve(elements_number) trusts elements_number from the RDB stream. A malformed/hostile snapshot can force huge allocations (and additional allocations from level/links_num). Add sanity limits/validation for these values and fail the load or skip restore safely when limits are exceeded.

Copilot · 2026-02-12T19:30:53Z

src/core/search/hnsw_index.h

+  // Mark index as restored from RDB (should use UpdateVectorData instead of Add)
+  void SetRestored(bool restored) {
+    restored_ = restored;
+  }
+
+  bool IsRestored() const {
+    return restored_;
+  }
+
 private:
  bool copy_vector_;
  size_t dim_;
  VectorSimilarity sim_;
  std::unique_ptr<HnswlibAdapter> adapter_;
+
+  bool restored_ = false;
 };


restored_ is a plain bool that’s written from the loader (RestoreFromNodes/SetRestored) and read from shard builders (IsRestored) across multiple shard threads. This introduces a data race. Make it std::atomic_bool (or guard reads/writes with an existing mutex) and keep the read/write pattern consistent.

Copilot AI review requested due to automatic review settings February 3, 2026 14:12

Copilot started reviewing on behalf of BorysTheDev February 3, 2026 14:12 View session

augmentcode bot reviewed Feb 3, 2026

View reviewed changes

src/core/search/hnsw_index.cc Outdated Show resolved Hide resolved

src/core/search/hnsw_index.cc Show resolved Hide resolved

src/core/search/hnsw_index.cc Show resolved Hide resolved

src/core/search/hnsw_index.cc Outdated Show resolved Hide resolved

Copilot AI reviewed Feb 3, 2026

View reviewed changes

BorysTheDev force-pushed the HNSW_index_deserialization branch from 4f6d9c1 to cf80534 Compare February 4, 2026 09:32

Copilot AI review requested due to automatic review settings February 4, 2026 09:43

BorysTheDev force-pushed the HNSW_index_deserialization branch from cf80534 to 0a2fcd4 Compare February 4, 2026 09:43

Copilot started reviewing on behalf of BorysTheDev February 4, 2026 09:43 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

BorysTheDev force-pushed the HNSW_index_deserialization branch from 0a2fcd4 to 3f7185b Compare February 4, 2026 14:17

Copilot AI review requested due to automatic review settings February 4, 2026 14:57

BorysTheDev force-pushed the HNSW_index_deserialization branch from 3f7185b to f6c9227 Compare February 4, 2026 14:57

Copilot started reviewing on behalf of BorysTheDev February 4, 2026 14:58 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

BorysTheDev force-pushed the HNSW_index_deserialization branch from f6c9227 to e95885b Compare February 4, 2026 16:17

Copilot AI review requested due to automatic review settings February 5, 2026 10:35

Copilot started reviewing on behalf of BorysTheDev February 5, 2026 10:35 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

BorysTheDev force-pushed the HNSW_index_deserialization branch from e6ba2c4 to ca6ef7d Compare February 5, 2026 10:42

Copilot AI review requested due to automatic review settings February 5, 2026 10:47

BorysTheDev force-pushed the HNSW_index_deserialization branch from ca6ef7d to 5398cfa Compare February 5, 2026 10:47

Copilot started reviewing on behalf of BorysTheDev February 5, 2026 10:48 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

BorysTheDev force-pushed the HNSW_index_deserialization branch from 5398cfa to ac2ca71 Compare February 5, 2026 10:57

Copilot AI review requested due to automatic review settings February 5, 2026 11:44

Copilot started reviewing on behalf of BorysTheDev February 5, 2026 11:45 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 6, 2026 07:29

BorysTheDev force-pushed the HNSW_index_deserialization branch from 4e4f7b1 to 624cace Compare February 6, 2026 07:29

Copilot started reviewing on behalf of BorysTheDev February 6, 2026 07:29 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

src/server/rdb_load.cc Outdated Show resolved Hide resolved

BorysTheDev force-pushed the HNSW_index_deserialization branch 2 times, most recently from 1d82eae to b91d174 Compare February 10, 2026 14:29

Copilot AI review requested due to automatic review settings February 10, 2026 14:29

Copilot started reviewing on behalf of BorysTheDev February 10, 2026 14:29 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

BorysTheDev force-pushed the HNSW_index_deserialization branch from b91d174 to 8da0111 Compare February 10, 2026 14:42

Copilot AI review requested due to automatic review settings February 11, 2026 10:11

BorysTheDev force-pushed the HNSW_index_deserialization branch from 8da0111 to 0587533 Compare February 11, 2026 10:11

Copilot started reviewing on behalf of BorysTheDev February 11, 2026 10:12 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

BorysTheDev commented Feb 11, 2026

View reviewed changes

src/server/rdb_load.cc Outdated Show resolved Hide resolved

BorysTheDev commented Feb 11, 2026

View reviewed changes

src/server/rdb_load.cc Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings February 12, 2026 08:50

BorysTheDev force-pushed the HNSW_index_deserialization branch from cf806b8 to c72b1f9 Compare February 12, 2026 08:50

Copilot started reviewing on behalf of BorysTheDev February 12, 2026 08:51 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

BorysTheDev force-pushed the HNSW_index_deserialization branch from c72b1f9 to fb1648f Compare February 12, 2026 14:41

BorysTheDev added 2 commits February 12, 2026 19:18

feat: add HNSW index deseriazation

3f713b6

refactor: remove extra synchronization

adda422

Copilot AI review requested due to automatic review settings February 12, 2026 19:18

BorysTheDev force-pushed the HNSW_index_deserialization branch from fb1648f to adda422 Compare February 12, 2026 19:18

Copilot started reviewing on behalf of BorysTheDev February 12, 2026 19:19 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

		std::vector<size_t> restored_internal_ids; // Track for cleanup on abort
		restored_internal_ids.reserve(nodes.size());

		OpArgs op_args{EngineShard::tlocal(), nullptr, db_cntx};
		index_->RebuildGlobalVectorIndices(index_->base_->name, op_args);

-  LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition", true);
+  // Avoid issuing duplicate FT.CREATE commands across shard files by guarding on index name.
+  // This allows us to treat all other FT.CREATE errors as real failures instead of ignoring them.
+  bool should_create = true;
+  // Try to extract the index name from the FT.CREATE definition: "FT.CREATE <index> ..."
+  // We only use this guard when the string starts with FT.CREATE and has at least two tokens.
+  auto parts =
+      absl::StrSplit(def, absl::ByAnyChar(" \t\r\n"), absl::SkipWhitespace());
+  if (parts.size() >= 2 && parts[0] == "FT.CREATE") {
+    std::string index_name(std::string(parts[1]));
+    {
+      std::lock_guard lk(search_index_mu_);
+      auto [it, inserted] = created_search_indices_.insert(index_name);
+      should_create = inserted;
+    }
+  }
+  if (!should_create) {
+    // Index was already created from another shard's aux data; skip duplicate FT.CREATE.
+    return;
+  }
+  LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition",
+                           /*ignore_errors=*/false);

feat: add HSNW index deserialization NOT READY FOR REVIEW #6531

Are you sure you want to change the base?

feat: add HSNW index deserialization NOT READY FOR REVIEW #6531

Conversation

BorysTheDev commented Feb 3, 2026

Uh oh!

augmentcode bot commented Feb 3, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!