[DiskBBQ] Use a HNSW graph to compute neighbours #134109

iverase · 2025-09-04T08:07:01Z

when the number of centroids increases, then the bottle necjk on the k-means implementation becomes the computation of neighbours. For example force merging 10 million glove vectors with 200 dimension and with a target size of 64 ends up with around 150k centroids. When we see when we spensd time, we see that 80% of the time is spent on computing the neighbours:

We are using brute force to compute the neighbours so it is expected not to scale well with the number of neighbours. This PR proposes to use an HNSW structure to compute the neighbours when the number of centroid is over a threshold. After this change, merging becomes much faster as computing the neighbours now takes less than 10%:

I run some experiments that indicate that the overhead of buiding the hnsw graph is not worty until we have around 15_000 centroids, at that point we will benefit on using the HNSW structure.

For example, indexing and force merging 10 million glove vectors with 200 dimensions with target size of 64 in main looks like:

index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf  10000000         1302495               2990796             1

index_name                         index_type  visit_percentage(%)  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall    visited  filter_selectivity
---------------------------------  ----------  -------------------  -----------  ----------------  -------------  ------  ------  ---------  ------------------  
enwiki-20120502-lines-1k-200d.vec         ivf                 0.25         2.59              0.00           0.00  386.10    0.85   50551.41                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.50         3.84              0.00           0.00  260.42    0.88  100560.13                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.75         5.08              0.00           0.00  196.85    0.89  150554.77                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.00         6.27              0.00           0.00  159.49    0.90  200553.17                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.50         8.78              0.00           0.00  113.90    0.92  300555.43                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.00        11.20              0.00           0.00   89.29    0.92  400556.21                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.50        13.70              0.00           0.00   72.99    0.93  500555.61                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.00         7.40              0.00           0.00  135.14    0.91  245552.82                1.00

With this PR looks like:

index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf  10000000          451825                939612             1

index_name                         index_type  visit_percentage(%)  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall    visited  filter_selectivity
---------------------------------  ----------  -------------------  -----------  ----------------  -------------  ------  ------  ---------  ------------------  
enwiki-20120502-lines-1k-200d.vec         ivf                 0.25         2.58              0.00           0.00  387.60    0.84   50556.29                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.50         3.85              0.00           0.00  259.74    0.88  100551.64                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.75         5.13              0.00           0.00  194.93    0.90  150557.25                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.00         6.45              0.00           0.00  155.04    0.90  200555.97                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.50         8.74              0.00           0.00  114.42    0.92  300552.92                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.00        11.14              0.00           0.00   89.77    0.92  400554.12                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.50        13.67              0.00           0.00   73.15    0.93  500553.61                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.00         7.28              0.00           0.00  137.36    0.91  245558.59                1.00

increase of around 3x in indexing and merging throughput without change in recall.

elasticsearchmachine · 2025-09-04T08:07:25Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

iverase · 2025-09-04T12:17:41Z

Here are the results of the processing time for both algorithms with different number of vectors and different dimensions:

Benchmark                              (dims)  (numVectors)  Mode  Cnt    Score     Error  Units
ComputeNeighboursBenchmark.bruteForce     384          1000  avgt    3    0.038 ±   0.003   s/op
ComputeNeighboursBenchmark.bruteForce     384          2000  avgt    3    0.132 ±   0.042   s/op
ComputeNeighboursBenchmark.bruteForce     384          3000  avgt    3    0.272 ±   0.193   s/op
ComputeNeighboursBenchmark.bruteForce     384          5000  avgt    3    0.900 ±   0.479   s/op
ComputeNeighboursBenchmark.bruteForce     384         10000  avgt    3    4.274 ±   1.535   s/op
ComputeNeighboursBenchmark.bruteForce     384         20000  avgt    3   15.414 ±  35.316   s/op
ComputeNeighboursBenchmark.bruteForce     384         50000  avgt    3  123.440 ±  68.797   s/op
ComputeNeighboursBenchmark.bruteForce     782          1000  avgt    3    0.047 ±   0.016   s/op
ComputeNeighboursBenchmark.bruteForce     782          2000  avgt    3    0.172 ±   0.059   s/op
ComputeNeighboursBenchmark.bruteForce     782          3000  avgt    3    0.390 ±   0.128   s/op
ComputeNeighboursBenchmark.bruteForce     782          5000  avgt    3    1.299 ±   0.518   s/op
ComputeNeighboursBenchmark.bruteForce     782         10000  avgt    3    6.091 ±   4.378   s/op
ComputeNeighboursBenchmark.bruteForce     782         20000  avgt    3   30.728 ±   6.527   s/op
ComputeNeighboursBenchmark.bruteForce     782         50000  avgt    3  236.107 ± 100.472   s/op
ComputeNeighboursBenchmark.bruteForce    1024          1000  avgt    3    0.054 ±   0.005   s/op
ComputeNeighboursBenchmark.bruteForce    1024          2000  avgt    3    0.205 ±   0.043   s/op
ComputeNeighboursBenchmark.bruteForce    1024          3000  avgt    3    0.521 ±   0.352   s/op
ComputeNeighboursBenchmark.bruteForce    1024          5000  avgt    3    1.634 ±   0.762   s/op
ComputeNeighboursBenchmark.bruteForce    1024         10000  avgt    3    7.679 ±   1.254   s/op
ComputeNeighboursBenchmark.bruteForce    1024         20000  avgt    3   42.399 ±   4.083   s/op
ComputeNeighboursBenchmark.bruteForce    1024         50000  avgt    3  308.647 ±  67.909   s/op
ComputeNeighboursBenchmark.graph          384          1000  avgt    3    0.200 ±   0.008   s/op
ComputeNeighboursBenchmark.graph          384          2000  avgt    3    0.567 ±   0.044   s/op
ComputeNeighboursBenchmark.graph          384          3000  avgt    3    1.036 ±   0.529   s/op
ComputeNeighboursBenchmark.graph          384          5000  avgt    3    2.151 ±   2.361   s/op
ComputeNeighboursBenchmark.graph          384         10000  avgt    3    5.732 ±   1.222   s/op
ComputeNeighboursBenchmark.graph          384         20000  avgt    3   15.093 ±   1.793   s/op
ComputeNeighboursBenchmark.graph          384         50000  avgt    3   55.819 ±  27.741   s/op
ComputeNeighboursBenchmark.graph          782          1000  avgt    3    0.284 ±   0.043   s/op
ComputeNeighboursBenchmark.graph          782          2000  avgt    3    0.856 ±   0.620   s/op
ComputeNeighboursBenchmark.graph          782          3000  avgt    3    1.632 ±   0.983   s/op
ComputeNeighboursBenchmark.graph          782          5000  avgt    3    3.654 ±   2.807   s/op
ComputeNeighboursBenchmark.graph          782         10000  avgt    3    9.702 ±   1.897   s/op
ComputeNeighboursBenchmark.graph          782         20000  avgt    3   25.971 ±   7.712   s/op
ComputeNeighboursBenchmark.graph          782         50000  avgt    3   97.879 ± 179.753   s/op
ComputeNeighboursBenchmark.graph         1024          1000  avgt    3    0.396 ±   1.442   s/op
ComputeNeighboursBenchmark.graph         1024          2000  avgt    3    1.365 ±   5.740   s/op
ComputeNeighboursBenchmark.graph         1024          3000  avgt    3    3.392 ±   8.409   s/op
ComputeNeighboursBenchmark.graph         1024          5000  avgt    3    5.816 ±  17.216   s/op
ComputeNeighboursBenchmark.graph         1024         10000  avgt    3   12.104 ±   9.293   s/op
ComputeNeighboursBenchmark.graph         1024         20000  avgt    3   33.329 ±  28.191   s/op
ComputeNeighboursBenchmark.graph         1024         50000  avgt    3  120.856 ±  14.384   s/op

benwtrent · 2025-09-04T12:24:39Z

Love this idea. I think it will really help a significant bottle neck at larger scale.

Its interesting how the actual hierarchical kmeans is still so cheap, but this "fix up phase" ends up being expensive.

You will like this @jimczi & @tveasey ;)

iverase · 2025-09-04T12:39:19Z

Its interesting how the actual hierarchical k-means is still so cheap, but this "fix up phase" ends up being expensive.

What I have observed is that in low memory scenarios, hierarchical kmeans becomes the expensive part because of the random access pattern during slicing.

jimczi

This is great and something we'll rely on for many other steps since we rely on this centroid search everywhere. I left some comments on the parametrization, it would be nice to publish your macro benchmark too.

jimczi · 2025-09-04T12:56:21Z

server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/KMeansLocal.java

+                return this;
+            }
+        };
+        final OnHeapHnswGraph graph = HnswGraphBuilder.create(supplier, 16, 100, 42L).build(centers.length);


I think it's worth spending a bit more time on optimising this. In my testing, M=8 had the best ratio of recall/visited percentage so might be beneficial to publish your macro benchmark.

I refactor the code in 5034bca to publish the benchmark.

I do think that 8 with a larger beamwidth is worth trying. (8, 150)

jimczi · 2025-09-04T12:58:09Z

server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/KMeansLocal.java

+    private NeighborHood[] computeNeighborhoods(float[][] centers, int clustersPerNeighborhood) throws IOException {
+        assert centers.length > clustersPerNeighborhood;
+        // experiments shows that below 15k, we better use brute force, otherwise hnsw gives us a nice speed up
+        if (centers.length < 15_000) {


I think we can optimise the graph to work better for lower scale but this is good as a first threshold. That's for segments greater than 1M with 64 vectors per centroid.

Reducing the number of connections could make this threshold smaller.

Agree. I didn't spend too much because time it seems pretty fast for low values (few seconds) so I wonder if there is need to optimize those cases.

I think just picking something "good enough" is alright. It provides a nice improvement and any optimizations we make won't be "format breaking" :)

jimczi · 2025-09-04T12:59:09Z

server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/KMeansLocal.java

+        for (int i = 0; i < centers.length; i++) {
+            scorer.setScoringOrdinal(i);
+            singleBit.indexSet = i;
+            final KnnCollector collector = HnswGraphSearcher.search(scorer, clustersPerNeighborhood, graph, singleBit, Integer.MAX_VALUE);


Did you test multiple sizes? I guess that the recall is important here so we should aim for a recall of 1?

We always use 128 for clustersPerNeighborhood. While ideally recall should be close to 1, the test does not show lost of quality on the centroids.

Do you mean we should oversample here, for example use 2 * clustersPerNeighborhood to make sure we always get the top clustersPerNeighborhood?

Do you mean we should oversample here, for example use 2 * clustersPerNeighborhood to make sure we always get the top clustersPerNeighborhood?

Generally, your approximate measure for HNSW is efSearch, which would be in this case an oversample. I am not sure 2x is required, but possibly more than just the number of nearest neighbors we care about.

Do we really need 128 no matter what number of centroids we have? Reducing this value when we have a small number of centroids could make the graph strategy applicable earlier.

I do think it should be scaled down for a lower value when there are fewer centroids. I do not know what that value would be.

The number is coupled to the recursive cluster splits to help capture potentially mis-assigned vectors along the split edges.

benwtrent

great stuff! I too have concerns about the graph parameters. Lower M, higher ef_construction,

then I think we need a "custom collector" that can behave semi-optimally for us (at a minimum, allow resource reuse).

server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/NeighborHood.java

john-wagster

it's impressive with results posted seems valuable as is. I didn't notice any glaring issues; lgtm

iverase · 2025-09-04T15:47:24Z

Pushed a new version where:

We construct the grap with M=8 and EF_CONSTRUCTION=150
Introduce a ReusableKnnCollector and drop the SingleBit implementation
Oversample the collector by 2.

Interestingly this has not drop the number of centroids required to amortise the construction of the graph. It still around 15k vectors. On the other hand, it is faster for high number of vectors:

Benchmark                         (dims)  (numVectors)  Mode  Cnt   Score    Error  Units
ComputeNeighboursBenchmark.graph     384          1000  avgt    3   0.227 ±  0.126   s/op
ComputeNeighboursBenchmark.graph     384          2000  avgt    3   0.591 ±  0.038   s/op
ComputeNeighboursBenchmark.graph     384          3000  avgt    3   1.059 ±  0.354   s/op
ComputeNeighboursBenchmark.graph     384          5000  avgt    3   2.112 ±  0.576   s/op
ComputeNeighboursBenchmark.graph     384         10000  avgt    3   5.347 ±  9.749   s/op
ComputeNeighboursBenchmark.graph     384         20000  avgt    3  12.969 ±  1.577   s/op
ComputeNeighboursBenchmark.graph     384         50000  avgt    3  46.449 ± 33.010   s/op
ComputeNeighboursBenchmark.graph     782          1000  avgt    3   0.293 ±  0.016   s/op
ComputeNeighboursBenchmark.graph     782          2000  avgt    3   0.810 ±  0.036   s/op
ComputeNeighboursBenchmark.graph     782          3000  avgt    3   1.481 ±  0.706   s/op
ComputeNeighboursBenchmark.graph     782          5000  avgt    3   3.125 ±  1.140   s/op
ComputeNeighboursBenchmark.graph     782         10000  avgt    3   8.380 ±  1.535   s/op
ComputeNeighboursBenchmark.graph     782         20000  avgt    3  21.029 ±  3.780   s/op
ComputeNeighboursBenchmark.graph     782         50000  avgt    3  73.378 ± 44.400   s/op
ComputeNeighboursBenchmark.graph    1024          1000  avgt    3   0.356 ±  0.415   s/op
ComputeNeighboursBenchmark.graph    1024          2000  avgt    3   0.952 ±  0.268   s/op
ComputeNeighboursBenchmark.graph    1024          3000  avgt    3   1.839 ±  0.657   s/op
ComputeNeighboursBenchmark.graph    1024          5000  avgt    3   3.823 ±  5.224   s/op
ComputeNeighboursBenchmark.graph    1024         10000  avgt    3  10.132 ±  2.812   s/op
ComputeNeighboursBenchmark.graph    1024         20000  avgt    3  25.402 ±  3.602   s/op
ComputeNeighboursBenchmark.graph    1024         50000  avgt    3  85.614 ±  9.996   s/op

Indexing glove data is faster too:

index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf  10000000          179285                702674             1

index_name                         index_type  visit_percentage(%)  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall    visited  filter_selectivity
---------------------------------  ----------  -------------------  -----------  ----------------  -------------  ------  ------  ---------  ------------------  
enwiki-20120502-lines-1k-200d.vec         ivf                 0.25         2.66              0.00           0.00  375.94    0.85   50552.81                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.50         4.41              0.00           0.00  226.76    0.88  100560.75                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.75         5.58              0.00           0.00  179.21    0.90  150558.51                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.00         6.55              0.00           0.00  152.67    0.91  200559.44                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.50         9.26              0.00           0.00  107.99    0.92  300557.77                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.00        11.88              0.00           0.00   84.18    0.92  400557.32                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.50        14.61              0.00           0.00   68.45    0.93  500552.98                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.00         7.70              0.00           0.00  129.87    0.91  245554.05                1.00

benwtrent

I think this is great!

iverase added 2 commits September 4, 2025 08:22

[DiskBBQ] Use a hnsw graph to compute neighbours

ab150c8

iter

31bac6b

iverase requested review from benwtrent and john-wagster September 4, 2025 08:07

iverase added >non-issue :Search Relevance/Vectors Vector search v9.2.0 labels Sep 4, 2025

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Sep 4, 2025

iter

53f4bd6

jimczi reviewed Sep 4, 2025

View reviewed changes

iverase added 3 commits September 4, 2025 14:19

add benchmark

5034bca

Merge branch 'main' into computeNeighbours

b153b8e

fix test

651e3bf

benwtrent reviewed Sep 4, 2025

View reviewed changes

address review comments

2264a46

john-wagster approved these changes Sep 4, 2025

View reviewed changes

benwtrent approved these changes Sep 4, 2025

View reviewed changes

Merge branch 'main' into computeNeighbours

d4eea58

jimczi approved these changes Sep 4, 2025

View reviewed changes

Merge branch 'main' into computeNeighbours

b9cdea0

iverase merged commit 45842f8 into elastic:main Sep 5, 2025
33 checks passed

iverase deleted the computeNeighbours branch September 5, 2025 06:38

[DiskBBQ] Use a HNSW graph to compute neighbours #134109

[DiskBBQ] Use a HNSW graph to compute neighbours #134109

Uh oh!

Conversation

iverase commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 4, 2025

Uh oh!

iverase commented Sep 4, 2025

Uh oh!

benwtrent commented Sep 4, 2025

Uh oh!

iverase commented Sep 4, 2025

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iverase Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

john-wagster left a comment

Choose a reason for hiding this comment

Uh oh!

iverase commented Sep 4, 2025

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iverase commented Sep 4, 2025 •

edited

Loading

iverase Sep 4, 2025 •

edited

Loading