Skip to content

Conversation

@iverase
Copy link
Contributor

@iverase iverase commented Sep 4, 2025

when the number of centroids increases, then the bottle necjk on the k-means implementation becomes the computation of neighbours. For example force merging 10 million glove vectors with 200 dimension and with a target size of 64 ends up with around 150k centroids. When we see when we spensd time, we see that 80% of the time is spent on computing the neighbours:

image

We are using brute force to compute the neighbours so it is expected not to scale well with the number of neighbours. This PR proposes to use an HNSW structure to compute the neighbours when the number of centroid is over a threshold. After this change, merging becomes much faster as computing the neighbours now takes less than 10%:

image

I run some experiments that indicate that the overhead of buiding the hnsw graph is not worty until we have around 15_000 centroids, at that point we will benefit on using the HNSW structure.

For example, indexing and force merging 10 million glove vectors with 200 dimensions with target size of 64 in main looks like:

index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf  10000000         1302495               2990796             1

index_name                         index_type  visit_percentage(%)  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall    visited  filter_selectivity
---------------------------------  ----------  -------------------  -----------  ----------------  -------------  ------  ------  ---------  ------------------  
enwiki-20120502-lines-1k-200d.vec         ivf                 0.25         2.59              0.00           0.00  386.10    0.85   50551.41                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.50         3.84              0.00           0.00  260.42    0.88  100560.13                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.75         5.08              0.00           0.00  196.85    0.89  150554.77                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.00         6.27              0.00           0.00  159.49    0.90  200553.17                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.50         8.78              0.00           0.00  113.90    0.92  300555.43                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.00        11.20              0.00           0.00   89.29    0.92  400556.21                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.50        13.70              0.00           0.00   72.99    0.93  500555.61                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.00         7.40              0.00           0.00  135.14    0.91  245552.82                1.00

With this PR looks like:

index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf  10000000          451825                939612             1

index_name                         index_type  visit_percentage(%)  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall    visited  filter_selectivity
---------------------------------  ----------  -------------------  -----------  ----------------  -------------  ------  ------  ---------  ------------------  
enwiki-20120502-lines-1k-200d.vec         ivf                 0.25         2.58              0.00           0.00  387.60    0.84   50556.29                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.50         3.85              0.00           0.00  259.74    0.88  100551.64                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.75         5.13              0.00           0.00  194.93    0.90  150557.25                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.00         6.45              0.00           0.00  155.04    0.90  200555.97                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.50         8.74              0.00           0.00  114.42    0.92  300552.92                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.00        11.14              0.00           0.00   89.77    0.92  400554.12                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.50        13.67              0.00           0.00   73.15    0.93  500553.61                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.00         7.28              0.00           0.00  137.36    0.91  245558.59                1.00

increase of around 3x in indexing and merging throughput without change in recall.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Sep 4, 2025
@iverase
Copy link
Contributor Author

iverase commented Sep 4, 2025

Here are the results of the processing time for both algorithms with different number of vectors and different dimensions:

Benchmark                              (dims)  (numVectors)  Mode  Cnt    Score     Error  Units
ComputeNeighboursBenchmark.bruteForce     384          1000  avgt    3    0.038 ±   0.003   s/op
ComputeNeighboursBenchmark.bruteForce     384          2000  avgt    3    0.132 ±   0.042   s/op
ComputeNeighboursBenchmark.bruteForce     384          3000  avgt    3    0.272 ±   0.193   s/op
ComputeNeighboursBenchmark.bruteForce     384          5000  avgt    3    0.900 ±   0.479   s/op
ComputeNeighboursBenchmark.bruteForce     384         10000  avgt    3    4.274 ±   1.535   s/op
ComputeNeighboursBenchmark.bruteForce     384         20000  avgt    3   15.414 ±  35.316   s/op
ComputeNeighboursBenchmark.bruteForce     384         50000  avgt    3  123.440 ±  68.797   s/op
ComputeNeighboursBenchmark.bruteForce     782          1000  avgt    3    0.047 ±   0.016   s/op
ComputeNeighboursBenchmark.bruteForce     782          2000  avgt    3    0.172 ±   0.059   s/op
ComputeNeighboursBenchmark.bruteForce     782          3000  avgt    3    0.390 ±   0.128   s/op
ComputeNeighboursBenchmark.bruteForce     782          5000  avgt    3    1.299 ±   0.518   s/op
ComputeNeighboursBenchmark.bruteForce     782         10000  avgt    3    6.091 ±   4.378   s/op
ComputeNeighboursBenchmark.bruteForce     782         20000  avgt    3   30.728 ±   6.527   s/op
ComputeNeighboursBenchmark.bruteForce     782         50000  avgt    3  236.107 ± 100.472   s/op
ComputeNeighboursBenchmark.bruteForce    1024          1000  avgt    3    0.054 ±   0.005   s/op
ComputeNeighboursBenchmark.bruteForce    1024          2000  avgt    3    0.205 ±   0.043   s/op
ComputeNeighboursBenchmark.bruteForce    1024          3000  avgt    3    0.521 ±   0.352   s/op
ComputeNeighboursBenchmark.bruteForce    1024          5000  avgt    3    1.634 ±   0.762   s/op
ComputeNeighboursBenchmark.bruteForce    1024         10000  avgt    3    7.679 ±   1.254   s/op
ComputeNeighboursBenchmark.bruteForce    1024         20000  avgt    3   42.399 ±   4.083   s/op
ComputeNeighboursBenchmark.bruteForce    1024         50000  avgt    3  308.647 ±  67.909   s/op
ComputeNeighboursBenchmark.graph          384          1000  avgt    3    0.200 ±   0.008   s/op
ComputeNeighboursBenchmark.graph          384          2000  avgt    3    0.567 ±   0.044   s/op
ComputeNeighboursBenchmark.graph          384          3000  avgt    3    1.036 ±   0.529   s/op
ComputeNeighboursBenchmark.graph          384          5000  avgt    3    2.151 ±   2.361   s/op
ComputeNeighboursBenchmark.graph          384         10000  avgt    3    5.732 ±   1.222   s/op
ComputeNeighboursBenchmark.graph          384         20000  avgt    3   15.093 ±   1.793   s/op
ComputeNeighboursBenchmark.graph          384         50000  avgt    3   55.819 ±  27.741   s/op
ComputeNeighboursBenchmark.graph          782          1000  avgt    3    0.284 ±   0.043   s/op
ComputeNeighboursBenchmark.graph          782          2000  avgt    3    0.856 ±   0.620   s/op
ComputeNeighboursBenchmark.graph          782          3000  avgt    3    1.632 ±   0.983   s/op
ComputeNeighboursBenchmark.graph          782          5000  avgt    3    3.654 ±   2.807   s/op
ComputeNeighboursBenchmark.graph          782         10000  avgt    3    9.702 ±   1.897   s/op
ComputeNeighboursBenchmark.graph          782         20000  avgt    3   25.971 ±   7.712   s/op
ComputeNeighboursBenchmark.graph          782         50000  avgt    3   97.879 ± 179.753   s/op
ComputeNeighboursBenchmark.graph         1024          1000  avgt    3    0.396 ±   1.442   s/op
ComputeNeighboursBenchmark.graph         1024          2000  avgt    3    1.365 ±   5.740   s/op
ComputeNeighboursBenchmark.graph         1024          3000  avgt    3    3.392 ±   8.409   s/op
ComputeNeighboursBenchmark.graph         1024          5000  avgt    3    5.816 ±  17.216   s/op
ComputeNeighboursBenchmark.graph         1024         10000  avgt    3   12.104 ±   9.293   s/op
ComputeNeighboursBenchmark.graph         1024         20000  avgt    3   33.329 ±  28.191   s/op
ComputeNeighboursBenchmark.graph         1024         50000  avgt    3  120.856 ±  14.384   s/op

@benwtrent
Copy link
Member

Love this idea. I think it will really help a significant bottle neck at larger scale.

Its interesting how the actual hierarchical kmeans is still so cheap, but this "fix up phase" ends up being expensive.

You will like this @jimczi & @tveasey ;)

@iverase
Copy link
Contributor Author

iverase commented Sep 4, 2025

Its interesting how the actual hierarchical k-means is still so cheap, but this "fix up phase" ends up being expensive.

What I have observed is that in low memory scenarios, hierarchical kmeans becomes the expensive part because of the random access pattern during slicing.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great and something we'll rely on for many other steps since we rely on this centroid search everywhere. I left some comments on the parametrization, it would be nice to publish your macro benchmark too.

return this;
}
};
final OnHeapHnswGraph graph = HnswGraphBuilder.create(supplier, 16, 100, 42L).build(centers.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth spending a bit more time on optimising this. In my testing, M=8 had the best ratio of recall/visited percentage so might be beneficial to publish your macro benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactor the code in 5034bca to publish the benchmark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think that 8 with a larger beamwidth is worth trying. (8, 150)

private NeighborHood[] computeNeighborhoods(float[][] centers, int clustersPerNeighborhood) throws IOException {
assert centers.length > clustersPerNeighborhood;
// experiments shows that below 15k, we better use brute force, otherwise hnsw gives us a nice speed up
if (centers.length < 15_000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can optimise the graph to work better for lower scale but this is good as a first threshold. That's for segments greater than 1M with 64 vectors per centroid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the number of connections could make this threshold smaller.

Copy link
Contributor Author

@iverase iverase Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I didn't spend too much because time it seems pretty fast for low values (few seconds) so I wonder if there is need to optimize those cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just picking something "good enough" is alright. It provides a nice improvement and any optimizations we make won't be "format breaking" :)

for (int i = 0; i < centers.length; i++) {
scorer.setScoringOrdinal(i);
singleBit.indexSet = i;
final KnnCollector collector = HnswGraphSearcher.search(scorer, clustersPerNeighborhood, graph, singleBit, Integer.MAX_VALUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test multiple sizes? I guess that the recall is important here so we should aim for a recall of 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use 128 for clustersPerNeighborhood. While ideally recall should be close to 1, the test does not show lost of quality on the centroids.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should oversample here, for example use 2 * clustersPerNeighborhood to make sure we always get the top clustersPerNeighborhood?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should oversample here, for example use 2 * clustersPerNeighborhood to make sure we always get the top clustersPerNeighborhood?

Generally, your approximate measure for HNSW is efSearch, which would be in this case an oversample. I am not sure 2x is required, but possibly more than just the number of nearest neighbors we care about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need 128 no matter what number of centroids we have? Reducing this value when we have a small number of centroids could make the graph strategy applicable earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it should be scaled down for a lower value when there are fewer centroids. I do not know what that value would be.

The number is coupled to the recursive cluster splits to help capture potentially mis-assigned vectors along the split edges.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff! I too have concerns about the graph parameters. Lower M, higher ef_construction,

then I think we need a "custom collector" that can behave semi-optimally for us (at a minimum, allow resource reuse).

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's impressive with results posted seems valuable as is. I didn't notice any glaring issues; lgtm

@iverase
Copy link
Contributor Author

iverase commented Sep 4, 2025

Pushed a new version where:

  1. We construct the grap with M=8 and EF_CONSTRUCTION=150
  2. Introduce a ReusableKnnCollector and drop the SingleBit implementation
  3. Oversample the collector by 2.

Interestingly this has not drop the number of centroids required to amortise the construction of the graph. It still around 15k vectors. On the other hand, it is faster for high number of vectors:

Benchmark                         (dims)  (numVectors)  Mode  Cnt   Score    Error  Units
ComputeNeighboursBenchmark.graph     384          1000  avgt    3   0.227 ±  0.126   s/op
ComputeNeighboursBenchmark.graph     384          2000  avgt    3   0.591 ±  0.038   s/op
ComputeNeighboursBenchmark.graph     384          3000  avgt    3   1.059 ±  0.354   s/op
ComputeNeighboursBenchmark.graph     384          5000  avgt    3   2.112 ±  0.576   s/op
ComputeNeighboursBenchmark.graph     384         10000  avgt    3   5.347 ±  9.749   s/op
ComputeNeighboursBenchmark.graph     384         20000  avgt    3  12.969 ±  1.577   s/op
ComputeNeighboursBenchmark.graph     384         50000  avgt    3  46.449 ± 33.010   s/op
ComputeNeighboursBenchmark.graph     782          1000  avgt    3   0.293 ±  0.016   s/op
ComputeNeighboursBenchmark.graph     782          2000  avgt    3   0.810 ±  0.036   s/op
ComputeNeighboursBenchmark.graph     782          3000  avgt    3   1.481 ±  0.706   s/op
ComputeNeighboursBenchmark.graph     782          5000  avgt    3   3.125 ±  1.140   s/op
ComputeNeighboursBenchmark.graph     782         10000  avgt    3   8.380 ±  1.535   s/op
ComputeNeighboursBenchmark.graph     782         20000  avgt    3  21.029 ±  3.780   s/op
ComputeNeighboursBenchmark.graph     782         50000  avgt    3  73.378 ± 44.400   s/op
ComputeNeighboursBenchmark.graph    1024          1000  avgt    3   0.356 ±  0.415   s/op
ComputeNeighboursBenchmark.graph    1024          2000  avgt    3   0.952 ±  0.268   s/op
ComputeNeighboursBenchmark.graph    1024          3000  avgt    3   1.839 ±  0.657   s/op
ComputeNeighboursBenchmark.graph    1024          5000  avgt    3   3.823 ±  5.224   s/op
ComputeNeighboursBenchmark.graph    1024         10000  avgt    3  10.132 ±  2.812   s/op
ComputeNeighboursBenchmark.graph    1024         20000  avgt    3  25.402 ±  3.602   s/op
ComputeNeighboursBenchmark.graph    1024         50000  avgt    3  85.614 ±  9.996   s/op

Indexing glove data is faster too:

index_name                         index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments
---------------------------------  ----------  --------  --------------  --------------------  ------------  
enwiki-20120502-lines-1k-200d.vec         ivf  10000000          179285                702674             1

index_name                         index_type  visit_percentage(%)  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall    visited  filter_selectivity
---------------------------------  ----------  -------------------  -----------  ----------------  -------------  ------  ------  ---------  ------------------  
enwiki-20120502-lines-1k-200d.vec         ivf                 0.25         2.66              0.00           0.00  375.94    0.85   50552.81                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.50         4.41              0.00           0.00  226.76    0.88  100560.75                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.75         5.58              0.00           0.00  179.21    0.90  150558.51                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.00         6.55              0.00           0.00  152.67    0.91  200559.44                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 1.50         9.26              0.00           0.00  107.99    0.92  300557.77                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.00        11.88              0.00           0.00   84.18    0.92  400557.32                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 2.50        14.61              0.00           0.00   68.45    0.93  500552.98                1.00
enwiki-20120502-lines-1k-200d.vec         ivf                 0.00         7.70              0.00           0.00  129.87    0.91  245554.05                1.00

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great!

@iverase iverase merged commit 45842f8 into elastic:main Sep 5, 2025
33 checks passed
@iverase iverase deleted the computeNeighbours branch September 5, 2025 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants