Skip to content

Conversation

@Pulkitg64
Copy link
Contributor

@Pulkitg64 Pulkitg64 commented Jul 29, 2025

Description

This is a draft PR to optimize HNSW graph merging during singleton merges. When merging a single segment with deletions, the current implementation reconstructs the entire graph with only live nodes, which is a time-consuming process. This PR avoids full graph reconstruction by dropping deleted nodes and renumbering the remaining live nodes.

TODOs:

Add specific unit tests
Benchmarks (luceneutil)

@Pulkitg64 Pulkitg64 changed the title Avoid reconstructing HNSW graph during singleton merging Avoid reconstructing HNSW graph during singleton merges Jul 29, 2025
@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@jpountz
Copy link
Contributor

jpountz commented Aug 1, 2025

I don't feel qualified to do the review, but I agree with the motivation. I wonder if this optimization could be applied when there are more than 1 segment to merge by first applying deletions on the bigger segment to merge and then adding vectors from other segments?

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a promising direction! I left a bunch of comments. My main one is about whether we should do this on-heap to make it more flexible (eg so we could use it when merging multiple graphs, too).


private static final long SHALLOW_RAM_BYTES_USED =
RamUsageEstimator.shallowSizeOfInstance(Lucene99HnswVectorsWriter.class);
static final int DELETE_THRESHOLD_PERCENT = 30;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we have done any testing to motivate this choice? I guess as the number of gaps in the neighborhoods left behind by removing the deleted nodes in the graph increases we would expect to see a drop-off in recall, or maybe performance? but I don't have a good intuition about whether there is a knee in the curve, or how strong the effect is

* @throws IOException If an error occurs while writing to the vector index
*/
private HnswGraph deleteNodesWriteGraph(
Lucene99HnswVectorsReader.OffHeapHnswGraph graph,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the signature to accept an HnswGraph?

// Count and collect valid nodes
int validNodeCount = 0;
for (int node : sortedNodes) {
if (docMap.get(node) != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might be able to pass in the size of the new graph? At least in the main case of merging we should know (I think?)

}

// Special case for top level with no valid nodes
if (level == numLevels - 1 && validNodeCount == 0 && level > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if level ==0 and validNodeCount == 0 the new graph should be empty. I'm not sure how that case will get handled here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case though (the top level would be empty) -- isn't it also possible that a lower level is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if level ==0 and validNodeCount == 0 the new graph should be empty. I'm not sure how that case will get handled here?

This means 100% nodes are deleted, right? I think we will never reach this case as entry condition to this function is checking if deletes are less than 30%.

validNodeCount = 1; // We'll create one connection to lower level
}

validNodesPerLevel[level] = new int[validNodeCount];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could avoid the up-front counting, allocate a full-sized array and then use only the part of it that we fill up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Math.toIntExact(vectorIndex.getFilePointer() - offsetStart);
}

// Special case for empty top level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should special case the first empty level we findand make that the top level, unless it is the bottom level in which case the whole graph is empty


/** Writes neighbors with delta encoding to the vector index. */
private void writeNeighbors(
Lucene99HnswVectorsReader.OffHeapHnswGraph graph,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we delegate to an existing method (maybe with a refactor) to ensure we write in the same format? EG what if we switch to GroupVarInt encoding - we want to make sure this method tracks that change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

try {
long vectorIndexOffset = vectorIndex.getFilePointer();

if (mergeState.liveDocs.length == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you seen IncrementalHnswGraphMerge and MergingHnswGraphBuilder? They select the biggest graph with no deletions and merge the other segments' graphs into it. Could we expose a utility method here for rewriting a graph (in memory) to drop deletions, and then use it there?

Here we are somewhat mixing the on-disk graph format with the logic of dropping deleted nodes, which I think we could abstract out intoi the util.hnsw realm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw that class. I think this is a good idea. Will do it in next revision.

@Pulkitg64
Copy link
Contributor Author

I wonder if this optimization could be applied when there are more than 1 segment to merge by first applying deletions on the bigger segment to merge and then adding vectors from other segments?

@jpountz Yes good idea, let me try doing that only in this PR

this seems like a promising direction! I left a bunch of comments. My main one is about whether we should do this on-heap to make it more flexible (eg so we could use it when merging multiple graphs, too).

Thanks @msokolov . Yes I think that would be best way forward for this optimization. Working on it.


// Process nodes at this level
for (int node : sortedNodes) {
if (docMap.get(node) == -1) {
Copy link
Contributor Author

@Pulkitg64 Pulkitg64 Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Graph does not store docIDs but instead they store ordinal. Whereas docMap maps oldDocIds to new DocIDs.
The correct implementation is to create a map which maps old ords to new ords.

Will fix this in next revision.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented Aug 11, 2025

The failing test is running fine on my macOS desktop and I have not changed anything in the related classes. Even with the same failing seed I am unable to reproduce the issue. Not sure why this test in failing in check.

TestBPReorderingMergePolicy > testReorderOnAddIndexes FAILED
    java.lang.AssertionError: Called on the wrong instance
        at org.apache.lucene.test_framework@11.0.0-SNAPSHOT/org.apache.lucene.tests.codecs.asserting.AssertingKnnVectorsFormat$AssertingKnnVectorsReader.getFloatVectorValues(AssertingKnnVectorsFormat.java:140)
        at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader.getFloatVectorValues(PerFieldKnnVectorsFormat.java:289)
        at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CodecReader.getFloatVectorValues(CodecReader.java:244)
        at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.SlowCompositeCodecReaderWrapper$SlowCompositeKnnVectorsReaderWrapper.getFloatVectorValues(SlowCompositeCodecReaderWrapper.java:842)
        at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.CodecReader.getFloatVectorValues(CodecReader.java:244)
        at org.apache.lucene.misc.index.BpVectorReorderer.computeDocMap(BpVectorReorderer.java:590)
        at org.apache.lucene.misc.index.BPReorderingMergePolicy$1.reorder(BPReorderingMergePolicy.java:138)
        at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.addIndexesReaderMerge(IndexWriter.java:3426)
        at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter$AddIndexesMergeSource.merge(IndexWriter.java:3334)
        at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:664)
        at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:726)

@Pulkitg64 Pulkitg64 changed the title Avoid reconstructing HNSW graph during singleton merges Avoid reconstructing HNSW graphs during segment merging. Aug 11, 2025
@Pulkitg64
Copy link
Contributor Author

I created a pull request to my own repository, there tests are working fine: Pulkitg64#1

This issue looks to be transient and next commit should fix this.

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented Aug 13, 2025

Adding some KnnPerfTestResults where I tried to simulate deletes while indexing docs. We are seeing consistent improvement in Indexing Time and Indexing Rate (except one weird case when we deleted 40% docs) without impacting recall.

Num Docs: 1MM
Max-Conn: 32
Beam-Width: 250
Quantize Bits: 32
Topk: 100

Experiment Baseline Candidate % Change
% Deletes Recall Indexing Time Indexing Rate Recall Indexing Time Indexing Rate Indexing Time Indexing Rate
25 0.952 692 1443 0.955 576 1734 -17% 20%
30 0.952 581 1719 0.958 517 1932 -11% 12%
40 0.951 560 1782 0.945 553 1805 -1% 1%
50 0.96 446 2241 0.953 421 2371 -6% 6%
60 0.974 234 4265 0.972 208 4804 -11% 13%

@msokolov
Copy link
Contributor

I am confused! This PR suddenly got so much simpler, which is great, but I feel like it dropped a few things that seemed important. EG we are no longer checking the largest graph to see if its delete % is below a threshold? Also I think we are now ignoring the various edge cases around upper-level graph layers possibly becoming empty?

@Pulkitg64
Copy link
Contributor Author

With MaxConn = 16, I am seeing much better results. But on a weird case with 25% delete I am seeing regression in indexing rate. Trying maxConn=8 in next benchmark run

Experiment Baseline Candidate % Change
% Deletes Recall Indexing Time (s) Indexing Rate (docs/s) Recall Indexing Time (s) Indexing Rate (docs/s) Indexing Time Indexing Rate
25 0.922 453 2205 0.914 484 2063 7% -6%
30 0.918 470 2125 0.94 279 3581 -41% 69%
40 0.903 494 2022 0.942 258 3867 -48% 91%
50 0.915 421 2372 0.946 223 4466 -47% 88%
60 0.934 301 3303 0.947 214 4658 -29% 41%

@benwtrent
Copy link
Member

@Pulkitg64 what exactly are you benchmarking? It seems like the latest version of this PR does nothing to actually correct the graph nodes?

We should handle:

  • If layers get completely removed (do we promote new nodes?)
  • Removing deleted nodes and reconnecting the neighbors to their nearest non-deleted
  • Completely throwing away the graph if deletion percentage is above a certain threshold (the first commit of hits PR had that at 30%, I think it can maybe be as high as 50%).

@benwtrent
Copy link
Member

Ah, maybe I don't fully grok the current impl. It seems like its doing the "largest graph" thing, but now its more clever and doing the initialized graph thing and that is where the deletes are being removed?

@Pulkitg64
Copy link
Contributor Author

I am confused! This PR suddenly got so much simpler, which is great,

Yeah, the initGraph implementation in InitializedHnswGraphBuilder.java simplifies lot of things for us as it is already providing support of creating OnHeapHnswGraph by passing OffHeapGraph from the older segment.

we are no longer checking the largest graph to see if its delete % is below a threshold?

Yes, in the first revision I added the arbitrary percentage without doing any testing. But this time, I wanted to see the impact of mergePolicy that kicks in when delete % is higher than certain threshold. I thought we may not need to add explicit check of checking delete % of largest graph because merge policy will automatically take care of this.

Also I think we are now ignoring the various edge cases around upper-level graph layers possibly becoming empty?

initGraph implementation takes care of it. In the implementation we start with top level and if there is no live node in that level the new entry node is never set and when we will iterate to next level with some live nodes there we will set the new entry node. Hence in this way we remove the risk of empty upper layer in the graph. But on the other there is still risk of completely deleting middle layer which we need to take care I believe.

@Pulkitg64
Copy link
Contributor Author

Ah, maybe I don't fully grok the current impl. It seems like its doing the "largest graph" thing, but now its more clever and doing the initialized graph thing and that is where the deletes are being removed?

That's right @benwtrent, we are skipping deleted nodes from the largest graph in the initGraph implementation.

@benwtrent
Copy link
Member

@Pulkitg64 pretty damn clever ;). I gotta think through this. Intuitively, it SHOULD work, even for singleton merges

@msokolov
Copy link
Contributor

It's fascinating that we actually see recall improving in many cases! Intuitively, I think when we merge more segments in we have an opportunity to patch up the holes left by the deleted docs, and maybe we somehow end up doing that in an even better way the second time around?

I do wonder what recall will look like for graphs with high deletion rates that are singleton-merged only? I wonder if we could test that with luceneutil by creating a single-segment index (with force-merge), deleting 50% of the docs, and then force-merging again?

@Pulkitg64
Copy link
Contributor Author

Based on @msokolov suggestion, I ran the benchmarks by simulating singleton merging. For this I indexed 1M docs and then force merge the segments then delete documents and then again force merge the segment.

I am seeing consistent improvement (about 50x speedup) in force merge time after deletes but also degradation in recall numbers (about 10%). It's probably because of disconnectedness issue (Let me try to find connectedness number of these graphs as well.)

Experiment Baseline Candidate Change
Delete Pct Recall Force Merge Time (s) Recall Force Merge Time Recall Force Merge Time
50% delete 0.892 417.52 0.763 8.43 -14% 50x
40% delete 0.887 505.74 0.799 9.91 -10% 50x
30% delete 0.88 585 0.822 10.98 -7% 53x
20% delete 0.878 677 0.802 12.4 -9% 54x
10% delete 0.874 772.42 0.856 13.5 -2% 59x

@benwtrent
Copy link
Member

It's probably because of disconnectedness issue (Let me try to find connectedness number of these graphs as well.)

I would think so. My gut is that we don't actually go through and "fixup" anything when there is just one graph. We just pick the biggest one, and since there are no more vectors to add, we just drop connections on the ground.

I would expect us to have to iterate through the graph and for every vector that is significantly disconnected, attempt to reconnect it with NNDescent starting at is original place in the graph (initializing with neighbor's neighbors if all its connections were removed).

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@benwtrent
Copy link
Member

Thank you for all the work @Pulkitg64

OK, I think your plan from this comment (50% check) is the way to go for now. My intuition is that the relative threshold is likely better, but we would need to make the percentage lower (like 25%, or 20%, like your other benchmarks show).

However, the 50% shows really nice improvements already, so lets move forward with that.

As for "Reconstruct completely", yes, lets focus on 40% or fewer deleted total docs? Does that align with your benchmarking results?

If so, I think we have our two thresholds and we can move forward with this PR.

Thank you for this practical, and powerful improvement for vector indexing!

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented Nov 7, 2025

OK, I think your plan from this comment (50% check) is the way to go for now. My intuition is that the relative threshold is likely better, but we would need to make the percentage lower (like 25%, or 20%, like your other benchmarks show).

Since in my approach, even if graph has no deletes, it will still try to reconnect nodes which will make merging slower, hence I agree with you that "relative threshold" approach is likely better. Sharing benchmark results in some time

As for "Reconstruct completely", yes, lets focus on 40% or fewer deleted total docs? Does that align with your benchmarking results?

Keeping this threshold at 40% now.

@Pulkitg64
Copy link
Contributor Author

Taking a sweet spot to adjust between recall and indexing performance by taking 15% as threshold for relative reduction in connection for considering a node as disconnected, I am getting below results:

Experiment Baseline Candidate Change
Delete % MaxConn Recall ForceMergeTime Recall ForceMergeTime Recall Change ForceMerge Time Change
10 8 0.796 73.97 0.755 18.82 -5.15% 3.9x
10 16 0.89 109.05 0.853 28.09 -4.16% 3.8x
10 32 0.926 136.24 0.923 31.81 -0.32% 4.2x
10 64 0.935 148.34 0.935 35.64 0.00% 4.1x
20 8 0.8 63.94 0.783 38.02 -2.13% 1.6x
20 16 0.895 95.34 0.881 63.26 -1.56% 1.5x
20 32 0.931 116.78 0.916 90.51 -1.61% 1.29x
20 64 0.938 130.33 0.928 102 -1.07% 1.27x
30 8 0.808 55.11 0.797 43.72 -1.36% 1.26x
30 16 0.9 83.11 0.889 71.38 -1.22% 1.16x
30 32 0.935 100.64 0.927 94.53 -0.86% 1.06x
30 64 0.942 111.51 0.935 108.52 -0.74% 1.02x
40 8 0.814 46.78 0.795 40.57 -2.33% 1.15x
40 16 0.906 74.6 0.897 64.11 -0.99% 1.16x
40 32 0.94 85.76 0.932 81.03 -0.85% 1.05x
40 64 0.947 96.93 0.94 98.95 -0.74% 0.97x

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 10.4.0 milestone Nov 11, 2025
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, reading through it all, I really like where it ended up.

My main concern is testing. Maybe HnswGraphTestCase.testRandomReadWriteAndMerge is enough? Could you add some logic there for "delete few, delete many"? Just to make sure we adequately exercise this logic.

This really is great work! Thank you for all the iteration. The numbers are great!

OnHeapHnswGraph initializedGraph,
BitSet initializedNodes)
/**
* Fixes disconnected nodes at a specific level by performing graph searches from their existing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.vldb.org/pvldb/vol18/p5166-upreti.pdf has an in-place delete algorithm (search for algorithm 6). It may be worth investigating for this use case as it could be less expensive than repairing disconnected nodes, and the paper indicates they've successfully used this algorithm with no observed loss in accuracy.

Worth noting that this algorithm does not guarantee the removal of all in-edges to the deleted node in a directed graph, so even after performing this kind of deletion we'd have to sweep the graph and remove any dead edges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mccullocht for idea and sharing this paper. The idea of using the neighbors and neighbor of neighbors of the deleted nodes for reconnection seems like a good idea (definitely faster) and I think worth pursuing.
But I wonder if we should do it as part of this PR or different one. I can work on this idea immediately as part of different issue if that makes sense to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this PR has been thoroughly tested and I didn't want to block it but trying this algorithm would be a good follow up.

@Pulkitg64
Copy link
Contributor Author

Hi @benwtrent
Is there anything else we should add in this PR before merging it?

@benwtrent
Copy link
Member

@Pulkitg64 nope! I am back from eating a bunch of 🦃 . I will merge and backport this. Super excited for this change. Its a practical and very useful optimization for HNSW!

@benwtrent benwtrent merged commit ef10476 into apache:main Dec 1, 2025
12 checks passed
benwtrent pushed a commit that referenced this pull request Dec 1, 2025
optimizes HNSW graph merging. Now instead of not considering a large graph to merge into, deleted nodes are removed and the related connections are repaired. Additionally, nodes might be promoted to higher levels to account for loss of connectivity at higher layers.

Especially in singleton merges, this improves throughput significantly.
@mikemccand
Copy link
Member

Note that this caused some CI failures: #15467 -- let's hold off on backporting until we resolve that.

@benwtrent
Copy link
Member

@mikemccand I already backported, however a fix is in flight.

@mikemccand
Copy link
Member

Wow, fast! Thanks @benwtrent.

@Pulkitg64 Pulkitg64 deleted the singleton branch December 9, 2025 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants