Skip to content

Conversation

@irevoire
Copy link
Contributor

Pull Request

Related issue

Fixes #81

What does this PR do?

  • After looking into annoy it looks like we ignored the bias for the euclidean and manhattan distance, but using the bias the recall improved a little bit on small dimension datasets:
    On main:
    image
    After this PR:
    image

![WARNING]
Currently, this doesn't apply to the angular distance, we need to read the annoy code further to understand what they’re doing with the norm and if we’re doing everything correctly

And for the binary quantization?

Currently, I don't really know how we should implements that for the binary quantization. I did a few tests here and some improved the recall sometimes for almost 10 points:
image

@irevoire irevoire added enhancement New feature or request tech exploration Explore technical domains indexing Everything related to indexing relevancy db-breaking labels Sep 24, 2024
@nnethercott
Copy link
Contributor

Yo @irevoire I revived this PR in a fork cause it's concerning mathematically that the bias still isn't being used in arroy.

I'll drop my thoughts below (déso pour le pavé)

TL;DR: for large datasets with moderate n_trees performance gaps are big. If you use tons of trees though you can generally overcome that gap (which is what arroy currently does so it might be OK)

understanding the bias term

When the arroy::Writer is indexing, it builds separating hyperplanes to split leafs into two clusters. First two_means finds the centroids, $\{c_{0}, c_{1}\}$ and then we define a plane with unit normal, $n$, such that $n \cdot c_{0}>0$ and $n \cdot c_{1}<0$

A hyperplane is a set of points satisfying $n \cdot x+b=0$. We said earlier the plane we build passes through the midpoint of the centroids, so to find the bias we just plug that in and get

$$\begin{align*} n \cdot (c_{0} + c_{1})/2 + b &= 0 \\ \implies - n \cdot (c_{0} + c_{1})/2 &= b \end{align*}$$

That's the same as what's in arroy (but currently its ignored)

Here's what the plane should look like (bias = $- n \cdot (c_{0} + c_{1})/2$):
scene

Here's what it currently looks like (bias=0):
scene_no_bias

consequences

  • the margin for two points belonging to separate clusters can be the same
  • we potentially get poor splits at every iteration since the hyperplane doesn't necessarily cut the leaf subset in half
    • leads to substantially longer indexing times due to more recursive calls until fit_in_descendant evaluates to true
  • pointless computation in two_means since we're only keeping angular info, the "seperating" plane we build may put all elements on the same side !
    • random projections are totally valid as a hash, we could accomplish this with less compute than what we're doing here

Note

This is generally OK for Cosine and Dot distances since the data is normalized to the unit circle before indexing so the bias is close to 0 anyways. We're more worried about Euclidean and Manhattan...

results

impact on recall@k

We'll use the vector store benchmark for this. In the case where |dataset|>>n_trees using the bias generally outperforms the status quo. We purposely limit the number of trees to assess the quality of learned SplitPlaneNormals.

|dataset|=100_000, n_trees = 1

Click here

On main:
main_100k_1

After this PR:
pr_100k_1

|dataset|=100_000, n_trees = 100

Click here

On main:
main_100k_100

After this PR:
pr_100k_100

|dataset|=10_000, n_trees = 1

Click here

On main:
main_10k_1

After this PR:
pr_10k_1

|dataset|=10_000, n_trees = 100

Click here

On main:
main_10k_100

After this PR:
pr_10k_100

impact on indexing time

We'll just look at one dataset and the Euclidean distance since the idea is more or less the same for other datasets/distances. We expect main to be slower since imbalanced splits (e.g. those arising from not using the bias) require more recursive branching.

Here are the results for indexing 100000 vectors of dimension 768:

n_trees=1 n_trees=10 n_trees=100 n_trees=500
main 2.08s 5.93s 55.94s 649.31s
pr 3.44s 5.12s 48.82s 225.18s

a weird thing

If we run the benchmark and use the default number of trees, the approach with the proper use of the bias actually underperforms what's in arroy currently:

On main:
main_10k_768

After this PR:
pr_10k_768

My guess is that arroy overcomes poor quality trees with brute force search (high default number of trees, high oversampling) so the weak learners can ensemble better than the trees we produced by using the bias correctly.

Warning

The target_n_trees is proprtional to the embedding dimension, irrespective of the number of vectors being indexed.

from the code:

// 1. How many descendants do we need:
let descendant_required = item_indices.len() / dimensions;
// 2. Find the number of tree nodes required per trees
let tree_nodes_per_tree = descendant_required + 1;
// 3. Find the number of tree required to get as many tree nodes as item:
let mut nb_trees = item_indices.len() / tree_nodes_per_tree;

In the above nb_trees = n/((n/d)+1) = d/(1+d/n) ≈ d for any collection we're indexing. So for my test above with 10k vectors we're using nearly 1k trees by default. This might explain the performance difference.

conclusions

  • for large datasets using the bias generally yields significant performance improvements in terms of recall and indexing time
  • In cases where increasing n_trees is not feasible, using the bias properly should create higher quality trees
  • Performance gains depend on dataset & embeddings structure

Very curious to hear your thoughts on this

@irevoire
Copy link
Contributor Author

Oh, awesome investigation. Thanks again for your work, @nnethercott.
It also looks like it would solve #59 (maybe not, but it could, ahah).

The truth is we only use cosine and its binary quantized version on Meilisearch, so I didn't spend much time on the issue. However, I'm totally up for merging your PR.

leads to substantially longer indexing times due to more recursive calls until fit_in_descendant evaluates to true

It also leads to longer search requests, which is the biggest issue for us here 😭

pointless computation in two_means since we're only keeping angular info, the "seperating" plane we build may put all elements on the same side !
random projections are totally valid as a hash, we could accomplish this with less compute than what we're doing here

Not sure I understood these two points, sorry 🤔

The target_n_trees is proprtional to the embedding dimension, irrespective of the number of vectors being indexed.

Yep, that's an issue as well. I think there is an issue somewhere, but I couldn't find it.
For the context, when we released the first version of arroy, we did a "simplification" of what annoy was doing.
And a few weeks ago I tried to compute ahead of time what we were doing and noticed that the more documents we had, everything was simplifying to the number of dimensions.
This is a big issue for large datasets, I believe, but I didn't take the time to come back to what annoy was doing and see how it could be computed ahead of time 😔
Anyway, if you think you can fix it, go ahead, and I'd gladly merge a PR on that.

——-

TLDR: there was issue and which I knew about but never had the time to work on it.
Experimentally, I didn't see any large gain in the cosine distance, either in terms of indexing time or relevancy, so I gave up on this lead pretty quickly but Im still very interested in merging these changes.

@nnethercott
Copy link
Contributor

Experimentally, I didn't see any large gain in the cosine distance, either in terms of indexing time or relevancy, so I gave up on this lead pretty quickly but Im still very interested in merging these changes.

ok nice, I'll open the PR then :)

And a few weeks ago I tried to compute ahead of time what we were doing and noticed that the more documents we had, everything was simplifying to the number of dimensions.

I'll take a look here too, i think the easiest fix is just to take nb_trees proportional to sqrt(data.len()) or something like that, but then again choosing hyper params is always more of an art than a science haha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

db-breaking enhancement New feature or request indexing Everything related to indexing relevancy tech exploration Explore technical domains

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dead code to investigate

3 participants