- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.6k
[DiskBBQ] Use a HNSW graph to compute neighbours #134109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Pinging @elastic/es-search-relevance (Team:Search Relevance) | 
| Here are the results of the processing time for both algorithms with different number of vectors and different dimensions:  | 
| 
 What I have observed is that in low memory scenarios, hierarchical kmeans becomes the expensive part because of the random access pattern during slicing. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great and something we'll rely on for many other steps since we rely on this centroid search everywhere. I left some comments on the parametrization, it would be nice to publish your macro benchmark too.
| return this; | ||
| } | ||
| }; | ||
| final OnHeapHnswGraph graph = HnswGraphBuilder.create(supplier, 16, 100, 42L).build(centers.length); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth spending a bit more time on optimising this. In my testing, M=8 had the best ratio of recall/visited percentage so might be beneficial to publish your macro benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactor the code in 5034bca to publish the benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think that 8 with a larger beamwidth is worth trying. (8, 150)
| private NeighborHood[] computeNeighborhoods(float[][] centers, int clustersPerNeighborhood) throws IOException { | ||
| assert centers.length > clustersPerNeighborhood; | ||
| // experiments shows that below 15k, we better use brute force, otherwise hnsw gives us a nice speed up | ||
| if (centers.length < 15_000) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can optimise the graph to work better for lower scale but this is good as a first threshold. That's for segments greater than 1M with 64 vectors per centroid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducing the number of connections could make this threshold smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I didn't spend too much because time it seems pretty fast for low values (few seconds) so I wonder if there is need to optimize those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just picking something "good enough" is alright. It provides a nice improvement and any optimizations we make won't be "format breaking" :)
| for (int i = 0; i < centers.length; i++) { | ||
| scorer.setScoringOrdinal(i); | ||
| singleBit.indexSet = i; | ||
| final KnnCollector collector = HnswGraphSearcher.search(scorer, clustersPerNeighborhood, graph, singleBit, Integer.MAX_VALUE); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test multiple sizes? I guess that the recall is important here so we should aim for a recall of 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use 128 for clustersPerNeighborhood. While ideally recall should be close to 1, the test does not show lost of quality on the centroids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should oversample here, for example use 2 * clustersPerNeighborhood to make sure we always get the top clustersPerNeighborhood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should oversample here, for example use 2 * clustersPerNeighborhood to make sure we always get the top clustersPerNeighborhood?
Generally, your approximate measure for HNSW is efSearch, which would be in this case an oversample. I am not sure 2x is required, but possibly more than just the number of nearest neighbors we care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need 128 no matter what number of centroids we have? Reducing this value when we have a small number of centroids could make the graph strategy applicable earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it should be scaled down for a lower value when there are fewer centroids. I do not know what that value would be.
The number is coupled to the recursive cluster splits to help capture potentially mis-assigned vectors along the split edges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff! I too have concerns about the graph parameters. Lower M, higher ef_construction,
then I think we need a "custom collector" that can behave semi-optimally for us (at a minimum, allow resource reuse).
        
          
                server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/NeighborHood.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/NeighborHood.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/codec/vectors/cluster/NeighborHood.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's impressive with results posted seems valuable as is. I didn't notice any glaring issues; lgtm
| Pushed a new version where: 
 Interestingly this has not drop the number of centroids required to amortise the construction of the graph. It still around 15k vectors. On the other hand, it is faster for high number of vectors: Indexing glove data is faster too:  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great!
when the number of centroids increases, then the bottle necjk on the k-means implementation becomes the computation of neighbours. For example force merging 10 million glove vectors with 200 dimension and with a target size of 64 ends up with around 150k centroids. When we see when we spensd time, we see that 80% of the time is spent on computing the neighbours:
We are using brute force to compute the neighbours so it is expected not to scale well with the number of neighbours. This PR proposes to use an HNSW structure to compute the neighbours when the number of centroid is over a threshold. After this change, merging becomes much faster as computing the neighbours now takes less than 10%:
I run some experiments that indicate that the overhead of buiding the hnsw graph is not worty until we have around 15_000 centroids, at that point we will benefit on using the HNSW structure.
For example, indexing and force merging 10 million glove vectors with 200 dimensions with target size of 64 in main looks like:
With this PR looks like:
increase of around 3x in indexing and merging throughput without change in recall.