Skip to content

Conversation

@BorysTheDev
Copy link
Contributor

No description provided.

Copilot AI review requested due to automatic review settings February 3, 2026 14:12
@augmentcode
Copy link

augmentcode bot commented Feb 3, 2026

🤖 Augment PR Summary

Summary: This PR adds initial support for deserializing global HNSW (vector) index graph structure from RDB snapshots.

Changes:

  • Extends HnswVectorIndex/HnswlibAdapter with metadata setters, graph restoration (RestoreFromNodes), and vector payload backfilling (UpdateVectorData).
  • Updates RDB loading to parse RDB_OPCODE_VECTOR_INDEX, rebuild node/link topology in-place, and mark indices as “restored”.
  • Parses optional hnsw_metadata from JSON AUX search-index definitions and applies it to created HNSW indices.
  • Adjusts post-load rebuild so restored indices populate vectors via UpdateVectorData instead of mutating the graph via Add.
  • Adds registry bookkeeping to track/clear restored indices across the load lifecycle.
  • Minor refactor in RDB save path to pass EngineShard by const reference when collecting search index definitions.

Technical Notes: Restoration is gated on matching shard counts (since GlobalDocId encodes shard id) and still relies on the standard rebuild pass to populate vector memory.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 4 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements HNSW (Hierarchical Navigable Small World) vector index deserialization from RDB snapshots. The feature allows HNSW graph structures to be saved and restored, improving startup times by avoiding full index rebuilds. The implementation uses a two-phase approach: first restoring the graph structure from the RDB file, then populating vector data during the rebuild phase.

Changes:

  • Added HNSW graph serialization/deserialization infrastructure with metadata and node data structures
  • Implemented graph restoration logic that reconstructs HNSW links without rebuilding from scratch
  • Modified RDB load/save to handle HNSW index data as a new opcode (RDB_OPCODE_VECTOR_INDEX)
  • Added tracking mechanism to distinguish between restored and newly-created indices during rebuild

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
src/server/search/global_hnsw_index.h Added methods to track restored indices and set metadata on HNSW indices
src/server/search/global_hnsw_index.cc Implemented restored indices tracking, metadata setting, and registry cleanup
src/server/search/doc_index.cc Modified rebuild to use UpdateVectorData for restored indices instead of Add
src/server/rdb_save.cc Changed CollectSearchIndices parameter from pointer to const reference
src/server/rdb_load.cc Implemented HNSW graph deserialization, metadata parsing from JSON, and post-load cleanup
src/core/search/hnsw_index.h Added RestoreFromNodes, UpdateVectorData, SetMetadata methods to support restoration
src/core/search/hnsw_index.cc Implemented low-level graph restoration with direct memory manipulation and vector data updates

@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from 4f6d9c1 to cf80534 Compare February 4, 2026 09:32
Copilot AI review requested due to automatic review settings February 4, 2026 09:43
@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from cf80534 to 0a2fcd4 Compare February 4, 2026 09:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from 0a2fcd4 to 3f7185b Compare February 4, 2026 14:17
Copilot AI review requested due to automatic review settings February 4, 2026 14:57
@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from 3f7185b to f6c9227 Compare February 4, 2026 14:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from f6c9227 to e95885b Compare February 4, 2026 16:17
Copilot AI review requested due to automatic review settings February 5, 2026 10:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.

@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from e6ba2c4 to ca6ef7d Compare February 5, 2026 10:42
Copilot AI review requested due to automatic review settings February 5, 2026 10:47
@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from ca6ef7d to 5398cfa Compare February 5, 2026 10:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from 5398cfa to ac2ca71 Compare February 5, 2026 10:57
Copilot AI review requested due to automatic review settings February 5, 2026 11:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Copilot AI review requested due to automatic review settings February 6, 2026 07:29
@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from 4e4f7b1 to 624cace Compare February 6, 2026 07:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch 2 times, most recently from 1d82eae to b91d174 Compare February 10, 2026 14:29
Copilot AI review requested due to automatic review settings February 10, 2026 14:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Comment on lines 325 to 329
// Set the metadata for the graph - use actual restored count for consistency
world_.cur_element_count.store(restored_count);
world_.maxlevel_ = metadata.maxlevel;
world_.enterpoint_node_ = metadata.enterpoint_node;

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestoreFromNodes sets world_.cur_element_count to restored_count without checking consistency with metadata.cur_element_count or validating that restored internal_ids are contiguous [0..N). If the nodes vector is missing entries or contains out-of-range/sparse internal_ids, the index state can become inconsistent. Consider validating ordering/contiguity and using metadata.cur_element_count (or max_internal_id + 1) only when consistent, otherwise abort restoration and rebuild from data.

Copilot uses AI. Check for mistakes.
Comment on lines 3199 to 3202
// Wait for all mappings to be applied
shard_set->RunBriefInParallel([](EngineShard*) {});

// Rebuild all search indices - for restored HNSW indices, this will populate vectors
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shard_set->RunBriefInParallel([](EngineShard*) {}) does not wait for callbacks enqueued via shard_set->Add(...) (it dispatches independently via DispatchBrief). If the intent is to wait until the RestoreKeyIndex callbacks run before continuing, use a shard-queue-based barrier (e.g., AwaitRunningOnShardQueue with a no-op) or rely on the ordering guarantee of subsequently enqueued shard-queue work and remove this misleading call.

Suggested change
// Wait for all mappings to be applied
shard_set->RunBriefInParallel([](EngineShard*) {});
// Rebuild all search indices - for restored HNSW indices, this will populate vectors
// Rebuild all search indices - for restored HNSW indices, this will populate vectors.
// AwaitRunningOnShardQueue runs after all previously enqueued shard-queue work (including
// RestoreKeyIndex callbacks scheduled via shard_set->Add).

Copilot uses AI. Check for mistakes.
Comment on lines 262 to 263
std::vector<size_t> restored_internal_ids; // Track for cleanup on abort
restored_internal_ids.reserve(nodes.size());
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored_internal_ids is populated but never used (comment suggests cleanup on abort, but there is no abort path). Either implement the intended rollback/cleanup on partial restore failures or remove the unused tracking to reduce confusion.

Suggested change
std::vector<size_t> restored_internal_ids; // Track for cleanup on abort
restored_internal_ids.reserve(nodes.size());

Copilot uses AI. Check for mistakes.
@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from b91d174 to 8da0111 Compare February 10, 2026 14:42
Copilot AI review requested due to automatic review settings February 11, 2026 10:11
@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from 8da0111 to 0587533 Compare February 11, 2026 10:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Comment on lines 303 to 318
// Restore links for layer 0
if (!node.levels_links.empty()) {
auto* ll0 = world_.get_linklist0(internal_id);
world_.setListCount(ll0, node.levels_links[0].size());
auto* links0 = reinterpret_cast<uint32_t*>(ll0 + 1);
std::copy(node.levels_links[0].begin(), node.levels_links[0].end(), links0);
}

// Restore links for upper layers
for (int lvl = 1; lvl <= node.level && lvl < static_cast<int>(node.levels_links.size());
++lvl) {
auto* ll = world_.get_linklist(internal_id, lvl);
world_.setListCount(ll, node.levels_links[lvl].size());
auto* links = reinterpret_cast<uint32_t*>(ll + 1);
std::copy(node.levels_links[lvl].begin(), node.levels_links[lvl].end(), links);
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestoreFromNodes() writes restored neighbor lists directly into HNSW link-list memory (setListCount + std::copy) without validating that levels_links[lvl].size() fits the per-layer capacity (maxM0_ for level 0, maxM_ for upper layers) and that level values are sane. With corrupted input this can overflow the allocated link-list buffers and corrupt memory. Add strict validation/clamping for level/links_num before copying, and prefer returning an RDB corruption error instead of crashing/overrunning.

Copilot uses AI. Check for mistakes.
Comment on lines 284 to 294
// Initialize vector data to a safe default. This prevents crashes during KNN search
// if UpdateVectorData is not called for this node (e.g., document was deleted).
// In copy mode: zero the vector memory. In borrowed mode: set pointer to nullptr.
if (world_.copy_vector_) {
char* data_ptr = world_.data_vector_memory_ + internal_id * world_.data_size_;
memset(data_ptr, 0, world_.data_size_);
} else {
char* ptr_location = world_.getDataPtrByInternalId(internal_id);
char* null_ptr = nullptr;
memcpy(ptr_location, &null_ptr, sizeof(void*));
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In borrowed-vector mode (!world_.copy_vector_), RestoreFromNodes() initializes each node’s vector pointer to nullptr. HierarchicalNSW::searchKnn() (in hnsw_alg.h) calls the distance function with getDataByInternalId() results without null checks, so any node left with a null vector pointer can segfault during KNN search. Use a safe non-null fallback (e.g. point to a zero-vector buffer owned by the index) and/or explicitly mark nodes as deleted until UpdateVectorData() succeeds (and ensure missing documents are removed).

Copilot uses AI. Check for mistakes.

for (const auto& [key, local_id] : key_index_.GetDocKeysMap()) {
auto it = db_slice.FindMutable(op_args.db_cntx, key, base_->GetObjCode());
if (!it || !IsValid(it->it)) {
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a key from the restored key-index mapping is missing from the DB (!it || !IsValid), the code just increments missing_documents and continues. For restored HNSW graphs this leaves nodes in the index with unpopulated vector data (and in borrowed-vector mode can lead to crashes during KNN). Consider removing/marking those nodes as deleted in the HNSW index when the backing document is missing, or otherwise ensuring they can’t participate in search.

Suggested change
if (!it || !IsValid(it->it)) {
if (!it || !IsValid(it->it)) {
// Backing document is missing or invalid. Ensure corresponding HNSW nodes
// are removed/marked-deleted so they cannot participate in search.
GlobalDocId global_id =
search::CreateGlobalDocId(EngineShard::tlocal()->shard_id(), local_id);
for (const auto& [field_ident, field_info] : GetIndexedHnswFields(base_->schema)) {
if (auto index =
GlobalHnswIndexRegistry::Instance().Get(index_name, field_info.short_name);
index) {
index->RemoveVectorData(global_id);
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +459 to +485
void ShardDocIndex::RebuildGlobalVectorIndices(std::string_view index_name, const OpArgs& op_args) {
// Don't run loop if no vector fields are present
if (std::ranges::empty(GetIndexedHnswFields(base_->schema)))
return;

// Check if any vector index was restored from RDB - if so, iterate by index keys
// which is more efficient than traversing the entire database
bool any_restored = false;
for (const auto& [field_ident, field_info] : GetIndexedHnswFields(base_->schema)) {
if (auto index = GlobalHnswIndexRegistry::Instance().Get(index_name, field_info.short_name);
index && index->IsRestored()) {
any_restored = true;
break;
}
}

if (any_restored) {
// Iterate by index keys - more efficient for restored indices
LOG(INFO) << "Restoring vector index '" << index_name << "' from serialized graph on shard "
<< EngineShard::tlocal()->shard_id();
RebuildGlobalVectorIndicesByIndexKeys(index_name, op_args);
} else {
// Iterate by database - needed when building new index
LOG(INFO) << "Rebuilding vector index '" << index_name << "' from database on shard "
<< EngineShard::tlocal()->shard_id() << " (no serialized graph available)";
RebuildGlobalVectorIndicesByDatabase(index_name, op_args);
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RebuildGlobalVectorIndices()/...ByIndexKeys() implement the intended post-RDB vector population via UpdateVectorData, but there’s no call site for RebuildGlobalVectorIndices() in the codebase (a repo-wide search only finds this definition). The current index build flow (search::IndexBuilder::VectorLoop) still calls AddDocToGlobalVectorIndex() which uses HnswVectorIndex::Add() even when an index was restored. As a result, restored graphs won’t get their vector data populated via UpdateVectorData. Wire this function into the rebuild/indexing path (or teach AddDocToGlobalVectorIndex/VectorLoop to call UpdateVectorData when IsRestored() is set).

Copilot uses AI. Check for mistakes.
Comment on lines 3140 to 3173
// Apply pending index key-to-DocId mappings before rebuilding indices
// Group mappings by shard_id for efficient dispatch
absl::flat_hash_map<uint32_t, std::vector<const PendingIndexMapping*>> mappings_by_shard;
for (const auto& pim : index_mappings) {
mappings_by_shard[pim.shard_id].push_back(&pim);
}

// Apply mappings on each shard (assuming same shard count as when snapshot was taken)
for (const auto& [shard_id, shard_mappings] : mappings_by_shard) {
if (shard_id >= shard_set->size()) {
LOG(ERROR) << "Invalid shard_id in RDB: " << shard_id << " (max: " << shard_set->size() - 1
<< "). Skipping index mappings for this shard.";
continue;
}
shard_set->Add(shard_id, [shard_mappings]() {
EngineShard* es = EngineShard::tlocal();
for (const auto* pim : shard_mappings) {
if (auto* index = es->search_indices()->GetIndex(pim->index_name); index) {
index->RestoreKeyIndex(pim->mappings);
VLOG(1) << "Restored " << pim->mappings.size() << " key mappings for index "
<< pim->index_name << " on shard " << pim->shard_id;
}
}
});
}
// Wait for all mappings to be applied
shard_set->RunBriefInParallel([](EngineShard*) {});

// Rebuild all search indices - for restored HNSW indices, this will populate vectors
shard_set->AwaitRunningOnShardQueue([](EngineShard* es) {
OpArgs op_args{es, nullptr,
DbContext{&namespaces->GetDefaultNamespace(), 0, GetCurrentTimeMs()}};
es->search_indices()->RebuildAllIndices(op_args);
});
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestoreKeyIndex() is applied just before calling RebuildAllIndices(), but ShardDocIndex::Rebuild() currently resets key_index_ (key_index_ = DocKeyIndex{}), discarding the restored key->DocId mapping. This breaks the core requirement for HNSW restoration (DocIds must match the serialized graph). Either rebuild indices while preserving the restored DocIds (iterate restored mappings and add docs with those IDs), or avoid calling the rebuild path that reassigns DocIds when restoration is enabled.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 12, 2026 08:50
@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from cf806b8 to c72b1f9 Compare February 12, 2026 08:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Comment on lines +2313 to +2318
CHECK(metadata);

// Restore the HNSW graph directly and mark as restored
hnsw_index->RestoreFromNodes(nodes, *metadata);

LOG(INFO) << "Restored HNSW index " << index_key << " with " << nodes.size() << " nodes";
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using CHECK(metadata) will crash the server on load if the AUX metadata is missing/invalid (or if the RDB contains vector index opcodes without matching metadata). Prefer a graceful fallback: log a warning, skip restoring this HNSW graph, and let the index rebuild from data (or return a load error) instead of aborting the process.

Suggested change
CHECK(metadata);
// Restore the HNSW graph directly and mark as restored
hnsw_index->RestoreFromNodes(nodes, *metadata);
LOG(INFO) << "Restored HNSW index " << index_key << " with " << nodes.size() << " nodes";
if (!metadata) {
LOG(WARNING) << "Missing HNSW metadata for index \"" << index_key
<< "\" (index_name=\"" << index_name << "\", field_name=\"" << field_name
<< "\") while restoring " << nodes.size()
<< " HNSW nodes from RDB. Skipping graph restore; index will be rebuilt "
<< "from base data.";
} else {
// Restore the HNSW graph directly and mark as restored
hnsw_index->RestoreFromNodes(nodes, *metadata);
LOG(INFO) << "Restored HNSW index " << index_key << " with " << nodes.size()
<< " nodes";
}

Copilot uses AI. Check for mistakes.
Comment on lines 314 to 318
auto* ll = world_.get_linklist(internal_id, lvl);
world_.setListCount(ll, node.levels_links[lvl].size());
auto* links = reinterpret_cast<uint32_t*>(ll + 1);
std::copy(node.levels_links[lvl].begin(), node.levels_links[lvl].end(), links);
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestoreFromNodes copies upper-layer links without validating the link count against the per-layer capacity (maxM_). This is another potential buffer overflow with malformed input. Validate counts and fail restoration (or return an error) instead of copying blindly.

Copilot uses AI. Check for mistakes.
Comment on lines +81 to +82
OpArgs op_args{EngineShard::tlocal(), nullptr, db_cntx};
index_->RebuildGlobalVectorIndices(index_->base_->name, op_args);
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VectorLoop calls RebuildGlobalVectorIndices directly when an index is restored, bypassing the shard FiberQueue serialization used below via shard_set->Await. This contradicts the mutex-ordering rationale in the comment (and can reintroduce deadlocks if multiple builders/updates touch multiple global vector indices concurrently). Consider dispatching the rebuild/update work through the shard queue as well, similar to the non-restored path.

Suggested change
OpArgs op_args{EngineShard::tlocal(), nullptr, db_cntx};
index_->RebuildGlobalVectorIndices(index_->base_->name, op_args);
// Serialize rebuild of global vector indices through the shard queue to preserve
// a single linear mutex acquisition order across all global indices.
auto shard_cb = [this, db_cntx] {
OpArgs op_args{EngineShard::tlocal(), nullptr, db_cntx};
index_->RebuildGlobalVectorIndices(index_->base_->name, op_args);
};
shard_set->Await(EngineShard::tlocal()->shard_id(), std::move(shard_cb));

Copilot uses AI. Check for mistakes.
Comment on lines +499 to +507
auto it = db_slice.FindMutable(op_args.db_cntx, key, base_->GetObjCode());
if (!it || !IsValid(it->it)) {
++missing_documents;
continue;
}

PrimeValue& pv = it->it->second;
auto doc = GetAccessor(op_args.db_cntx, pv);
GlobalDocId global_id = search::CreateGlobalDocId(EngineShard::tlocal()->shard_id(), local_id);
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the restored-index path, missing/invalid documents are counted and skipped, but their corresponding HNSW nodes remain in the restored graph. For HASH (borrowed vector) indices, those nodes still have nullptr vector pointers from RestoreFromNodes, and KNN can dereference nullptr via getDataByInternalId(). Consider removing/mark-deleting the node(s) in the global HNSW index when a document is missing/invalid to avoid crashes/ghost results.

Suggested change
auto it = db_slice.FindMutable(op_args.db_cntx, key, base_->GetObjCode());
if (!it || !IsValid(it->it)) {
++missing_documents;
continue;
}
PrimeValue& pv = it->it->second;
auto doc = GetAccessor(op_args.db_cntx, pv);
GlobalDocId global_id = search::CreateGlobalDocId(EngineShard::tlocal()->shard_id(), local_id);
GlobalDocId global_id =
search::CreateGlobalDocId(EngineShard::tlocal()->shard_id(), local_id);
auto it = db_slice.FindMutable(op_args.db_cntx, key, base_->GetObjCode());
if (!it || !IsValid(it->it)) {
++missing_documents;
// Ensure any existing HNSW nodes for this document are removed/marked-deleted so that
// restored graphs do not retain nodes with nullptr vector pointers.
for (const auto& [field_ident, field_info] : GetIndexedHnswFields(base_->schema)) {
if (auto index =
GlobalHnswIndexRegistry::Instance().Get(index_name, field_info.short_name);
index) {
index->MarkDeleted(global_id);
}
}
continue;
}
PrimeValue& pv = it->it->second;
auto doc = GetAccessor(op_args.db_cntx, pv);

Copilot uses AI. Check for mistakes.
Comment on lines 2227 to +2231
uint64_t elements_number;
SET_OR_RETURN(LoadLen(nullptr), elements_number);

// Only restore if flag enabled and shard count matches (GlobalDocId encodes shard_id)
bool should_restore = GetFlag(FLAGS_deserialize_hnsw_index) && shard_count_ > 0 &&
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elements_number is read from the RDB and then used to reserve memory and drive loops. With a corrupted/untrusted snapshot this can cause huge allocations or very long loops (DoS/OOM). Add sanity checks before reserving/iterating (e.g. cap by metadata.max_elements, by a hard limit, and/or return rdb_file_corrupted on unreasonable values).

Copilot uses AI. Check for mistakes.
Comment on lines +3166 to +3184
for (const auto& [shard_id, shard_mappings] : index_mappings) {
if (shard_id >= shard_set->size()) {
LOG(ERROR) << "Invalid shard_id in RDB: " << shard_id << " (max: " << shard_set->size() - 1
<< "). Skipping index mappings for this shard.";
continue;
}
shard_set->Add(shard_id, [&shard_mappings]() {
EngineShard* es = EngineShard::tlocal();
for (const auto& pim : shard_mappings) {
if (auto* index = es->search_indices()->GetIndex(pim.index_name); index) {
index->RestoreKeyIndex(pim.mappings);
VLOG(1) << "Restored " << pim.mappings.size() << " key mappings for index "
<< pim.index_name << " on shard " << es->shard_id();
}
}
});
}
// Wait for all mappings to be applied
shard_set->RunBriefInParallel([](EngineShard*) {});
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunBriefInParallel() does not guarantee that FiberQueue work enqueued via shard_set->Add() has been drained. As written, the code may proceed to RebuildAllIndices while mapping-restore tasks are still pending, leading to races and inconsistent key_index_ state. Use a proper barrier on the shard FiberQueues (e.g. AwaitRunningOnShardQueue / per-shard sentinel callbacks with a BlockingCounter) to ensure all Add() tasks completed before continuing.

Suggested change
for (const auto& [shard_id, shard_mappings] : index_mappings) {
if (shard_id >= shard_set->size()) {
LOG(ERROR) << "Invalid shard_id in RDB: " << shard_id << " (max: " << shard_set->size() - 1
<< "). Skipping index mappings for this shard.";
continue;
}
shard_set->Add(shard_id, [&shard_mappings]() {
EngineShard* es = EngineShard::tlocal();
for (const auto& pim : shard_mappings) {
if (auto* index = es->search_indices()->GetIndex(pim.index_name); index) {
index->RestoreKeyIndex(pim.mappings);
VLOG(1) << "Restored " << pim.mappings.size() << " key mappings for index "
<< pim.index_name << " on shard " << es->shard_id();
}
}
});
}
// Wait for all mappings to be applied
shard_set->RunBriefInParallel([](EngineShard*) {});
shard_set->AwaitRunningOnShardQueue([&index_mappings, shard_set](EngineShard* es) {
const uint32_t shard_id = es->shard_id();
if (shard_id >= shard_set->size()) {
LOG(ERROR) << "Invalid shard_id in RDB: " << shard_id << " (max: " << shard_set->size() - 1
<< "). Skipping index mappings for this shard.";
return;
}
auto it = index_mappings.find(shard_id);
if (it == index_mappings.end()) {
return;
}
const auto& shard_mappings = it->second;
for (const auto& pim : shard_mappings) {
if (auto* index = es->search_indices()->GetIndex(pim.index_name); index) {
index->RestoreKeyIndex(pim.mappings);
VLOG(1) << "Restored " << pim.mappings.size() << " key mappings for index "
<< pim.index_name << " on shard " << es->shard_id();
}
}
});

Copilot uses AI. Check for mistakes.
Comment on lines 304 to 308
if (!node.levels_links.empty()) {
auto* ll0 = world_.get_linklist0(internal_id);
world_.setListCount(ll0, node.levels_links[0].size());
auto* links0 = reinterpret_cast<uint32_t*>(ll0 + 1);
std::copy(node.levels_links[0].begin(), node.levels_links[0].end(), links0);
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestoreFromNodes copies level-0 links without validating the link count against the fixed linklist capacity (maxM0_) for the index. With corrupted snapshots this can overflow the linklist buffer and corrupt memory. Add bounds checks (and treat violations as snapshot corruption) before setListCount()/copy.

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +285
// Find max doc_id to size the keys_ vector appropriately
DocId max_id = 0;
for (const auto& [key, doc_id] : mappings) {
max_id = std::max(max_id, doc_id);
}

// Resize keys_ to accommodate all doc_ids
keys_.resize(max_id + 1);
last_id_ = max_id + 1;

// Restore the mappings
for (const auto& [key, doc_id] : mappings) {
keys_[doc_id] = key;
ids_[key] = doc_id;
}

// Build free_ids_ list for any gaps in the id sequence
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocKeyIndex::Restore computes max_id starting from 0 and then unconditionally does keys_.resize(max_id + 1). If mappings is empty, this leaves keys_ sized to 1 and last_id_=1, which makes subsequent Add() start at id=1 and leaves id=0 as a permanent gap. Handle the empty-mappings case explicitly (clear state, keep keys_ empty, last_id_=0).

Suggested change
// Find max doc_id to size the keys_ vector appropriately
DocId max_id = 0;
for (const auto& [key, doc_id] : mappings) {
max_id = std::max(max_id, doc_id);
}
// Resize keys_ to accommodate all doc_ids
keys_.resize(max_id + 1);
last_id_ = max_id + 1;
// Restore the mappings
for (const auto& [key, doc_id] : mappings) {
keys_[doc_id] = key;
ids_[key] = doc_id;
}
// Build free_ids_ list for any gaps in the id sequence
// Reset current state before restoring.
keys_.clear();
ids_.clear();
free_ids_.clear();
last_id_ = 0;
// If there are no mappings to restore, keep the index empty.
if (mappings.empty()) {
return;
}
// Find max doc_id to size the keys_ vector appropriately.
DocId max_id = 0;
for (const auto& [key, doc_id] : mappings) {
max_id = std::max(max_id, doc_id);
}
// Resize keys_ to accommodate all doc_ids.
keys_.resize(max_id + 1);
last_id_ = max_id + 1;
// Restore the mappings.
for (const auto& [key, doc_id] : mappings) {
keys_[doc_id] = key;
ids_[key] = doc_id;
}
// Build free_ids_ list for any gaps in the id sequence.

Copilot uses AI. Check for mistakes.
Comment on lines +279 to +283
// Restore the mappings
for (const auto& [key, doc_id] : mappings) {
keys_[doc_id] = key;
ids_[key] = doc_id;
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocKeyIndex::Restore does not clear existing ids_/free_ids_/keys_ before applying mappings, and it never resets free_ids_ prior to pushing gaps. If Restore() is called on a non-empty instance (e.g. due to duplicate AUX/restore flow), this will leave stale entries and duplicate free_ids_. Consider starting Restore() with a full reset (ids_.clear(), keys_.clear(), free_ids_.clear(), last_id_=0) and then rebuilding from mappings.

Copilot uses AI. Check for mistakes.
@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from c72b1f9 to fb1648f Compare February 12, 2026 14:41
Copilot AI review requested due to automatic review settings February 12, 2026 19:18
@BorysTheDev BorysTheDev force-pushed the HNSW_index_deserialization branch from fb1648f to adda422 Compare February 12, 2026 19:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

src/server/rdb_save.cc:1456

  • CollectSearchIndices only emits HNSW metadata for the first vector field, but SliceSnapshot::SerializeGlobalHnswIndices() serializes all global HNSW indices (one per index_name:field_name). On load, vector-index restoration expects metadata per serialized index_key (currently CHECK(metadata)), so additional vector fields will crash or skip restore. Emit metadata for every vector field (or include metadata alongside the VECTOR_INDEX opcode) and align loader expectations accordingly.
    // Collect HNSW metadata for vector field (first one found), for now we don't support multiple
    // vector fields per index serialization
    for (const auto& [fident, finfo] : index_info.base_index.schema.fields) {
      if (finfo.type == search::SchemaField::VECTOR &&
          !(finfo.flags & search::SchemaField::NOINDEX)) {
        if (auto hnsw_index = GlobalHnswIndexRegistry::Instance().Get(index_name, finfo.short_name);
            hnsw_index) {
          auto meta = hnsw_index->GetMetadata();
          TmpJson meta_json;
          meta_json["index_name"] = index_name;
          meta_json["field_name"] = finfo.short_name;
          meta_json["max_elements"] = meta.max_elements;
          meta_json["cur_element_count"] = meta.cur_element_count;
          meta_json["maxlevel"] = meta.maxlevel;
          meta_json["enterpoint_node"] = meta.enterpoint_node;
          hnsw_index_metadata->emplace_back(meta_json.to_string());
          break;
        }

Comment on lines +3148 to 3169
shard_set->Add(shard_id, [&shard_mappings]() {
EngineShard* es = EngineShard::tlocal();
for (const auto& pim : shard_mappings) {
if (auto* index = es->search_indices()->GetIndex(pim.index_name); index) {
index->RestoreKeyIndex(pim.mappings);
VLOG(1) << "Restored " << pim.mappings.size() << " key mappings for index "
<< pim.index_name << " on shard " << es->shard_id();
}
}
});
}
// Wait for all mappings to be applied
shard_set->RunBriefInParallel([](EngineShard*) {});
}

// Rebuild all search indices - for restored HNSW indices, VectorLoop will
// use UpdateVectorData instead of Add
shard_set->AwaitRunningOnShardQueue([](EngineShard* es) {
OpArgs op_args{es, nullptr,
DbContext{&namespaces->GetDefaultNamespace(), 0, GetCurrentTimeMs()}};
es->search_indices()->RebuildAllIndices(op_args);
});
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestoreKeyIndex() is applied before RebuildAllIndices(), but ShardDocIndex::Rebuild() currently resets key_index_ to a fresh DocKeyIndex{} (src/server/search/doc_index.cc:324). That discards the restored key→DocId mapping, so restored HNSW graphs (which are keyed by GlobalDocId) won’t match the DocIds assigned during rebuild. The rebuild path needs to preserve/restored DocIds (e.g., rebuild indices by iterating the restored mapping and adding docs with fixed ids) rather than reassigning them.

Copilot uses AI. Check for mistakes.
Comment on lines +3171 to +3175
// Clear restored flags on all HNSW indices now that all shards finished
for (const auto& [key, index] : GlobalHnswIndexRegistry::Instance().GetAll()) {
index->SetRestored(false);
}

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetRestored(false) is executed immediately after calling RebuildAllIndices(), but rebuild starts async builders and returns right away. This can clear the restored flag before IndexBuilder::VectorLoop() runs, causing restored graphs to be treated as non-restored (rebuilding instead of UpdateVectorData) and also races with concurrent reads of IsRestored(). Defer clearing until restoration/indexing has completed (or clear per-index when its rebuild finished).

Suggested change
// Clear restored flags on all HNSW indices now that all shards finished
for (const auto& [key, index] : GlobalHnswIndexRegistry::Instance().GetAll()) {
index->SetRestored(false);
}

Copilot uses AI. Check for mistakes.

LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition");
void RdbLoader::LoadSearchIndexDefFromAux(string&& def) {
LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition", true);
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadSearchIndexDefFromAux now ignores all FT.CREATE dispatch errors. This can hide real failures (e.g., bad schema/options), leaving the system without an index while later code assumes it exists for restoration. If this is only to tolerate duplicate FT.CREATE across shard files, consider restoring the previous “create-once” guard or selectively ignoring only the expected "Index already exists" error while still logging/handling other errors.

Suggested change
LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition", true);
// Avoid issuing duplicate FT.CREATE commands across shard files by guarding on index name.
// This allows us to treat all other FT.CREATE errors as real failures instead of ignoring them.
bool should_create = true;
// Try to extract the index name from the FT.CREATE definition: "FT.CREATE <index> ..."
// We only use this guard when the string starts with FT.CREATE and has at least two tokens.
auto parts =
absl::StrSplit(def, absl::ByAnyChar(" \t\r\n"), absl::SkipWhitespace());
if (parts.size() >= 2 && parts[0] == "FT.CREATE") {
std::string index_name(std::string(parts[1]));
{
std::lock_guard lk(search_index_mu_);
auto [it, inserted] = created_search_indices_.insert(index_name);
should_create = inserted;
}
}
if (!should_create) {
// Index was already created from another shard's aux data; skip duplicate FT.CREATE.
return;
}
LoadSearchCommandFromAux(service_, std::move(def), "FT.CREATE", "index definition",
/*ignore_errors=*/false);

Copilot uses AI. Check for mistakes.
Comment on lines 2335 to 2341
for (uint64_t i = 0; i < mapping_count; ++i) {
string key;
SET_OR_RETURN(FetchGenericString(), key);
uint64_t doc_id;
SET_OR_RETURN(LoadLen(nullptr), doc_id);
pim.mappings.emplace_back(std::move(key), static_cast<search::DocId>(doc_id));
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc_id is loaded as uint64_t and then cast to search::DocId without bounds checking. A malformed/hostile RDB could wrap/truncate IDs and/or make DocKeyIndex::Restore allocate an enormous keys_ vector (OOM). Validate doc_id <= std::numeric_limits<search::DocId>::max() (and consider also capping max_id / mapping_count) before storing/applying mappings.

Copilot uses AI. Check for mistakes.
Comment on lines +510 to +523
for (const auto& [field_ident, field_info] : GetIndexedHnswFields(base_->schema)) {
if (auto index = GlobalHnswIndexRegistry::Instance().Get(index_name, field_info.short_name);
index) {
bool success = index->UpdateVectorData(global_id, *doc, field_ident);
if (success) {
++successful_updates;
if (!index->IsVectorCopied()) {
pv.SetOmitDefrag(true);
}
} else {
++failed_updates;
}
}
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RebuildGlobalVectorIndicesByIndexKeys calls UpdateVectorData for every indexed HNSW field as long as the index exists, even if that particular HnswVectorIndex wasn’t restored (no graph nodes). In that case UpdateVectorData won’t populate anything (label not found) but the current code still counts it as success. Consider branching per-field: if index->IsRestored() use UpdateVectorData, otherwise fall back to Add (or rebuild-by-DB for that field).

Copilot uses AI. Check for mistakes.
Comment on lines +2251 to +2254
std::vector<search::HnswNodeData> nodes;
if (should_restore) {
nodes.reserve(elements_number);
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When restoration is enabled, nodes.reserve(elements_number) trusts elements_number from the RDB stream. A malformed/hostile snapshot can force huge allocations (and additional allocations from level/links_num). Add sanity limits/validation for these values and fail the load or skip restore safely when limits are exceeded.

Copilot uses AI. Check for mistakes.
Comment on lines +81 to 97
// Mark index as restored from RDB (should use UpdateVectorData instead of Add)
void SetRestored(bool restored) {
restored_ = restored;
}

bool IsRestored() const {
return restored_;
}

private:
bool copy_vector_;
size_t dim_;
VectorSimilarity sim_;
std::unique_ptr<HnswlibAdapter> adapter_;

bool restored_ = false;
};
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored_ is a plain bool that’s written from the loader (RestoreFromNodes/SetRestored) and read from shard builders (IsRestored) across multiple shard threads. This introduces a data race. Make it std::atomic_bool (or guard reads/writes with an existing mutex) and keep the read/write pattern consistent.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant