Skip to content

Conversation

@Thejas-bhat
Copy link
Member

@Thejas-bhat Thejas-bhat commented Jun 17, 2025

  • The main purpose of this PR is to avoid unnecessary re-training of the vector indexes during merge process.
  • Going by the numbers, we need roughly 156K vectors for a 1M dataset ((min_num_vectors_per_centroid) * num_centroids = 39 * 4 * sqrt(1M)) as per recommendation
  • The data ingestion is now split into 2 phases - the first phase involves creating a centroid index using the Train() API and the bolt is recorded with the progress in terms of samples trained upon. The second phase is just the normal indexing of data using the Batch() or the Index() APIs.
  • Later on, when the vector indexes are getting merged the merger will use the centroid index to merge the inverted lists (centroids) in a block-wise fashion without reconstructing the layout.

@Thejas-bhat Thejas-bhat changed the title WIP fast merge [WIP] MB-62182: Avoid re-training vector indexes during merge Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants