[WIP] MB-62182: Avoid re-training vector indexes during merge #2204

Thejas-bhat · 2025-06-17T09:00:31Z

The main purpose of this PR is to avoid unnecessary re-training of the vector indexes during merge process.
Going by the numbers, we need roughly 156K vectors for a 1M dataset ((min_num_vectors_per_centroid) * num_centroids = 39 * 4 * sqrt(1M)) as per recommendation
The data ingestion is now split into 2 phases - the first phase involves creating a centroid index using the Train() API and the bolt is recorded with the progress in terms of samples trained upon. The second phase is just the normal indexing of data using the Batch() or the Index() APIs.
Later on, when the vector indexes are getting merged the merger will use the centroid index to merge the inverted lists (centroids) in a block-wise fashion without reconstructing the layout.

abhinavdangeti added this to the v2.6.0 milestone Jul 21, 2025

Thejas-bhat force-pushed the fastmerge branch from f12b089 to acce003 Compare November 26, 2025 19:17

Thejas-bhat added 14 commits January 15, 2026 13:29

fastmerge wip

c53f616

passing zap config via new plugin APIs

9c23f76

use callbacks to collect and use train data while merging

1a68174

serialized float array

fa749e5

collect training sample on the file path as well

d443e0b

cleanup debug logs

90a42c9

vector sources API

4aa965b

batch training support

029a289

wip: batch training + interfaces to reuse pre-trained file

bd56a89

bug fix, debug logging

c1af6af

wip: implement async trainer loop with incremental training support

c8fb965

regulate train function using EventKindIndexStart

0a8b831

incremental training bug fixes + better recoverability

0cb5cf2

cleanup:

4e61bd8

Thejas-bhat force-pushed the fastmerge branch from 31c2cdb to 4e61bd8 Compare January 15, 2026 21:38

Thejas-bhat changed the title ~~WIP fast merge~~ [WIP] MB-62182: Avoid re-training vector indexes during merge Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] MB-62182: Avoid re-training vector indexes during merge #2204

[WIP] MB-62182: Avoid re-training vector indexes during merge #2204

Thejas-bhat commented Jun 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP] MB-62182: Avoid re-training vector indexes during merge #2204

Are you sure you want to change the base?

[WIP] MB-62182: Avoid re-training vector indexes during merge #2204

Conversation

Thejas-bhat commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Thejas-bhat commented Jun 17, 2025 •

edited

Loading