use the bias as intended #93

irevoire · 2024-09-24T13:10:14Z

Pull Request

Related issue

Fixes #81

What does this PR do?

After looking into annoy it looks like we ignored the bias for the euclidean and manhattan distance, but using the bias the recall improved a little bit on small dimension datasets:
On main:

After this PR:

![WARNING]
Currently, this doesn't apply to the angular distance, we need to read the annoy code further to understand what they’re doing with the norm and if we’re doing everything correctly

And for the binary quantization?

Currently, I don't really know how we should implements that for the binary quantization. I did a few tests here and some improved the recall sometimes for almost 10 points:

nnethercott · 2025-05-30T15:48:21Z

Yo @irevoire I revived this PR in a fork cause it's concerning mathematically that the bias still isn't being used in arroy.

I'll drop my thoughts below (déso pour le pavé)

TL;DR: for large datasets with moderate n_trees performance gaps are big. If you use tons of trees though you can generally overcome that gap (which is what arroy currently does so it might be OK)

understanding the bias term

When the arroy::Writer is indexing, it builds separating hyperplanes to split leafs into two clusters. First two_means finds the centroids, $\{c_{0}, c_{1}\}$ and then we define a plane with unit normal, $n$, such that $n \cdot c_{0}>0$ and $n \cdot c_{1}<0$

A hyperplane is a set of points satisfying $n \cdot x+b=0$. We said earlier the plane we build passes through the midpoint of the centroids, so to find the bias we just plug that in and get

$$\begin{align*} n \cdot (c_{0} + c_{1})/2 + b &= 0 \\ \implies - n \cdot (c_{0} + c_{1})/2 &= b \end{align*}$$

That's the same as what's in arroy (but currently its ignored)

Here's what the plane should look like (bias = $- n \cdot (c_{0} + c_{1})/2$):

Here's what it currently looks like (bias=0):

consequences

the margin for two points belonging to separate clusters can be the same
we potentially get poor splits at every iteration since the hyperplane doesn't necessarily cut the leaf subset in half
- leads to substantially longer indexing times due to more recursive calls until fit_in_descendant evaluates to true
pointless computation in two_means since we're only keeping angular info, the "seperating" plane we build may put all elements on the same side !
- random projections are totally valid as a hash, we could accomplish this with less compute than what we're doing here

Note

This is generally OK for Cosine and Dot distances since the data is normalized to the unit circle before indexing so the bias is close to 0 anyways. We're more worried about Euclidean and Manhattan...

results

impact on recall@k

We'll use the vector store benchmark for this. In the case where |dataset|>>n_trees using the bias generally outperforms the status quo. We purposely limit the number of trees to assess the quality of learned SplitPlaneNormals.

`|dataset|=100_000, n_trees = 1`

Click here

On main:

After this PR:

`|dataset|=100_000, n_trees = 100`

Click here

On main:

After this PR:

`|dataset|=10_000, n_trees = 1`

Click here

On main:

After this PR:

`|dataset|=10_000, n_trees = 100`

Click here

On main:

After this PR:

impact on indexing time

We'll just look at one dataset and the Euclidean distance since the idea is more or less the same for other datasets/distances. We expect main to be slower since imbalanced splits (e.g. those arising from not using the bias) require more recursive branching.

Here are the results for indexing 100000 vectors of dimension 768:

	n_trees=1	n_trees=10	n_trees=100	n_trees=500
main	2.08s	5.93s	55.94s	649.31s
pr	3.44s	5.12s	48.82s	225.18s

a weird thing

If we run the benchmark and use the default number of trees, the approach with the proper use of the bias actually underperforms what's in arroy currently:

On main:

After this PR:

My guess is that arroy overcomes poor quality trees with brute force search (high default number of trees, high oversampling) so the weak learners can ensemble better than the trees we produced by using the bias correctly.

Warning

The target_n_trees is proprtional to the embedding dimension, irrespective of the number of vectors being indexed.

from the code:

// 1. How many descendants do we need:
let descendant_required = item_indices.len() / dimensions;
// 2. Find the number of tree nodes required per trees
let tree_nodes_per_tree = descendant_required + 1;
// 3. Find the number of tree required to get as many tree nodes as item:
let mut nb_trees = item_indices.len() / tree_nodes_per_tree;

In the above nb_trees = n/((n/d)+1) = d/(1+d/n) ≈ d for any collection we're indexing. So for my test above with 10k vectors we're using nearly 1k trees by default. This might explain the performance difference.

conclusions

for large datasets using the bias generally yields significant performance improvements in terms of recall and indexing time
In cases where increasing n_trees is not feasible, using the bias properly should create higher quality trees
Performance gains depend on dataset & embeddings structure

Very curious to hear your thoughts on this

irevoire · 2025-05-31T08:26:01Z

Oh, awesome investigation. Thanks again for your work, @nnethercott.
It also looks like it would solve #59 (maybe not, but it could, ahah).

The truth is we only use cosine and its binary quantized version on Meilisearch, so I didn't spend much time on the issue. However, I'm totally up for merging your PR.

leads to substantially longer indexing times due to more recursive calls until fit_in_descendant evaluates to true

It also leads to longer search requests, which is the biggest issue for us here 😭

pointless computation in two_means since we're only keeping angular info, the "seperating" plane we build may put all elements on the same side !
random projections are totally valid as a hash, we could accomplish this with less compute than what we're doing here

Not sure I understood these two points, sorry 🤔

The target_n_trees is proprtional to the embedding dimension, irrespective of the number of vectors being indexed.

Yep, that's an issue as well. I think there is an issue somewhere, but I couldn't find it.
For the context, when we released the first version of arroy, we did a "simplification" of what annoy was doing.
And a few weeks ago I tried to compute ahead of time what we were doing and noticed that the more documents we had, everything was simplifying to the number of dimensions.
This is a big issue for large datasets, I believe, but I didn't take the time to come back to what annoy was doing and see how it could be computed ahead of time 😔
Anyway, if you think you can fix it, go ahead, and I'd gladly merge a PR on that.

——-

TLDR: there was issue and which I knew about but never had the time to work on it.
Experimentally, I didn't see any large gain in the cosine distance, either in terms of indexing time or relevancy, so I gave up on this lead pretty quickly but Im still very interested in merging these changes.

nnethercott · 2025-05-31T14:06:46Z

Experimentally, I didn't see any large gain in the cosine distance, either in terms of indexing time or relevancy, so I gave up on this lead pretty quickly but Im still very interested in merging these changes.

ok nice, I'll open the PR then :)

And a few weeks ago I tried to compute ahead of time what we were doing and noticed that the more documents we had, everything was simplifying to the number of dimensions.

I'll take a look here too, i think the easiest fix is just to take nb_trees proportional to sqrt(data.len()) or something like that, but then again choosing hyper params is always more of an art than a science haha

use the bias as intended

557c9bd

irevoire added enhancement New feature or request tech exploration Explore technical domains indexing Everything related to indexing relevancy db-breaking labels Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use the bias as intended #93

use the bias as intended #93

Uh oh!

irevoire commented Sep 24, 2024

Uh oh!

nnethercott commented May 30, 2025

Uh oh!

irevoire commented May 31, 2025

Uh oh!

nnethercott commented May 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

use the bias as intended #93

Are you sure you want to change the base?

use the bias as intended #93

Uh oh!

Conversation

irevoire commented Sep 24, 2024

Pull Request

Related issue

What does this PR do?

And for the binary quantization?

Uh oh!

nnethercott commented May 30, 2025

understanding the bias term

consequences

results

impact on recall@k

|dataset|=100_000, n_trees = 1

|dataset|=100_000, n_trees = 100

|dataset|=10_000, n_trees = 1

|dataset|=10_000, n_trees = 100

impact on indexing time

a weird thing

conclusions

Uh oh!

irevoire commented May 31, 2025

Uh oh!

nnethercott commented May 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`|dataset|=100_000, n_trees = 1`

`|dataset|=100_000, n_trees = 100`

`|dataset|=10_000, n_trees = 1`

`|dataset|=10_000, n_trees = 100`