Skip to content

Conversation

@nnethercott
Copy link
Contributor

@nnethercott nnethercott commented May 31, 2025

Pull Request

Currently arroy does not use the bias of the hyperplane in create_split, even though it is calculated. This is OK for Cosine and Dot distances where the bias is close to zero (vectors are normalized), but is problematic for Euclidean and Manhattan.

Check comments for more details/results.

Caution

PR is db breaking.

Fun fact: by serializing the node header you gain a lot more freedom in how splits are defined. For example, the hamming distance (#124) only requires a random index to define a split -- this can be chucked in the node header and the resulting leaf is much cheaper to store than a sparse vector (12 bytes vs [u8; n_dimensions])

Related issue

Fixes #81

What does this PR do?

  • Stores the normal as a Leaf<'_, Distance> in the SplitPlaneNormal and updates heed trait impls for se/deserialization
  • Removes margin_no_header from Distance trait in favour of margin
  • Calculates the bias for BinaryQuantizedManhattan and BinaryQuantizedEuclidean the same way as their non-quantized counterparts
  • Adds new std::fmt::Debug impl for all node headers to fix testing on different runners
  • Updates insta snapshots for tests

Impacts

  • Faster indexing and lookup times due to smaller tree depth on average as splits are generally more balanced (see pic below)
  • recall@k improves substantially for large datasets using Euclidean/Hamming distance

A more "balanced" split:
Screenshot from 2025-05-31 16-17-03

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

nnethercott added 5 commits May 29, 2025 17:48
margin_no_header is a footgun here, since signature for margin is the
same its easy to confound the two. this is bad since we may build a tree
not using the hyperplane bias but then configure the reader to use it
(for instance)
@nnethercott
Copy link
Contributor Author

Here's a bit more context on why this PR is needed, both from a theoretical and practical perspective.

understanding the bias term

When the arroy::Writer is indexing, it builds separating hyperplanes to split leafs into two clusters. First two_means finds the centroids, $\{c_{0}, c_{1}\}$ and then we define a plane with unit normal, $n$, such that $n \cdot c_{0}&gt;0$ and $n \cdot c_{1}&lt;0$

A hyperplane is a set of points satisfying $n \cdot x+b=0$. We said earlier the plane we build passes through the midpoint of the centroids, so to find the bias we just plug that in and get

$$\begin{align*} n \cdot (c_{0} + c_{1})/2 + b &= 0 \\ \implies - n \cdot (c_{0} + c_{1})/2 &= b \end{align*}$$

That's the same as what's in arroy (but currently its ignored)

Here's what the plane should look like (bias = $- n \cdot (c_{0} + c_{1})/2$):
scene

Here's what it currently looks like (bias=0):
scene_no_bias

consequences

  • the margin for two points belonging to separate clusters can be the same
  • we potentially get poor splits at every iteration since the hyperplane doesn't necessarily cut the leaf subset in half
    • leads to substantially longer indexing times due to more recursive calls until fit_in_descendant evaluates to true

Note

This is generally OK for Cosine and Dot distances since the data is normalized to the unit circle before indexing so the bias is close to 0 anyways. We're more worried about Euclidean and Manhattan...

results

impact on recall@k

We'll use the vector store benchmark for this. In the case where |dataset|>>n_trees using the bias generally outperforms the status quo. We purposely limit the number of trees to assess the quality of learned SplitPlaneNormals.

|dataset|=100_000, n_trees = 1

Click here

On main:
main_100k_1

After this PR:
pr_100k_1

|dataset|=100_000, n_trees = 100

Click here

On main:
main_100k_100

After this PR:
pr_100k_100

|dataset|=10_000, n_trees = 1

Click here

On main:
main_10k_1

After this PR:
pr_10k_1

|dataset|=10_000, n_trees = 100

Click here

On main:
main_10k_100

After this PR:
pr_10k_100

impact on indexing time

We'll just look at one dataset and the Euclidean distance since the idea is more or less the same for other datasets/distances. We expect main to be slower since imbalanced splits (e.g. those arising from not using the bias) require more recursive branching.

Here are the results for indexing 100000 vectors of dimension 768:

n_trees=1 n_trees=10 n_trees=100 n_trees=500
main 2.08s 5.93s 55.94s 649.31s
pr 3.44s 5.12s 48.82s 225.18s

a weird thing

If we run the benchmark and use the default number of trees, the approach with the proper use of the bias actually underperforms what's in arroy currently:

On main:
main_10k_768

After this PR:
pr_10k_768

My guess is that arroy overcomes poor quality trees with brute force search (high default number of trees, high oversampling) so the weak learners can ensemble better than the trees we produced by using the bias correctly.

Warning

The target_n_trees is proprtional to the embedding dimension, irrespective of the number of vectors being indexed.

from the code:

// 1. How many descendants do we need:
let descendant_required = item_indices.len() / dimensions;
// 2. Find the number of tree nodes required per trees
let tree_nodes_per_tree = descendant_required + 1;
// 3. Find the number of tree required to get as many tree nodes as item:
let mut nb_trees = item_indices.len() / tree_nodes_per_tree;

In the above nb_trees = n/((n/d)+1) = d/(1+d/n) ≈ d for any collection we're indexing. So for my test above with 10k vectors we're using nearly 1k trees by default. This might explain the performance difference.

conclusions

  • for large datasets using the bias generally yields significant performance improvements in terms of recall and indexing time
  • In cases where increasing n_trees is not feasible, using the bias properly should create higher quality trees
  • Performance gains depend on dataset & embeddings structure

nnethercott added 3 commits May 31, 2025 18:32
different platforms show varying numbers of digits in f32 debug impl =>
failing tests on linux vs macos runners. to fix this we ensure only 4
digits are printed
Copy link
Contributor

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that arroy overcomes poor quality trees with brute force search (high default number of trees, high oversampling) so the weak learners can ensemble better than the trees we produced by using the bias correctly.

The strange thing is that your PR doesn't reduce the number of trees right? So you actually bruteforce as much as on main but it doesn't give the right results 🤔

I was on my phone last time and didn't notice how much recall we were losing, that's huge, we can't really merge the PR in this state 😱
Maybe by updating the number of trees, you could try to get something that doesn't impact the recall that much?
Also, did you try to run the benchmark with more documents? Like 100_000 and maybe 1_000_000 just to give us an idea of how it "scales"

pub left: ItemId,
pub right: ItemId,
pub normal: Option<Cow<'a, UnalignedVector<D::VectorCodec>>>,
pub normal: Option<Leaf<'a, D>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a breaking change, we must upgrade all the SplitPlaneNormal in the previous version of the database.
The good thing is we're already doing it in this version, so it's the best version to do this breaking change, three breaking changes on the SplitPlaneNormal, all of its fields must be rewritten 👌

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't tell if this is just a comment or a request for changes ahah

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment, if you had updated another kind of node we would have work to do somewhere else but since you're updating the one we're already updating you don't have anything to do in your PR and it's noice 🔥

None
} else {
Some(normal)
let header = D::new_header(&vector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have the actual values that have been used in two_means, what will happen here?
The bias will always be zero, from what I see in the test, and the only way to retrieve the actual bias is to re-index everything, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D::new_header() creates a NodeHeader with bias=0.0 so it's effectively a no-op on indexes built with earlier version of arroy (which implicitly assume the bias is 0 anyways) - performance will remain the same
If users want to opt into the "improved" indexes they'll have to re-index everything like you were saying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit sad we don't have any way to recompute it later on, but I don't see how we could either since a split node doesn't know the full list of items it stores 🤔

@nnethercott
Copy link
Contributor Author

nnethercott commented Jun 3, 2025

@irevoire I think I have an answer for the performance issues. Hold onto your socks.

An idea I had was that when using the bias you end up with a lot of similar trees which filter down to the same descendants. If that were true then increasing the number of trees wouldn't yield any gains past a certain point cause we're just re-exploring the same sets.

Ok but why do we end up with the same trees when using the bias ?

Let's look at two_means real quick, specifically at this :

const ITERATION_STEPS: usize = 200;

When ITERATION_STEPS goes to infinity you end up with the same two centroids for any random seed. That's where the problem is for us; since ITERATION_STEPS is big the centroids aren't changing all that much => you end up with many trees that have roughly the same splits => performance is capped.

This is also the case to a lesser degree for the main branch, but there's more randomness cause you ignore the bias for each splitting plane => you get more diverse trees.

what's the fix?
Make your trees more diverse by decreasing ITERATION_STEPS

show me the proof
Here's the main branch with default n_trees and ITERATION_STEPS=200:
main_768_10k

This PR with default n_trees and ITERATION_STEPS=1
pr_768_10k

This PR with default n_trees, ITERATION_STEPS=1, but no bias (it's worse, nice !)
Screenshot from 2025-06-03 16-36-26

Only downside to making ITERATION_STEPS small is increased search times... But wait ! Since the splits are better we can use less trees and still beat the recall on main !

This PR with n_trees=400 and ITERATION_STEPS=1 -- faster and more accurate than main, nice !

Screenshot from 2025-06-03 17-09-16

making sense of everything
Here's my take on this whole subject;

  • when |dataset|>>n_trees using the bias is better than not using it
  • when n_trees can be big this PR underperformed cause the trees weren't diverse. This was fixed by decreasing ITERATION_STEPS in two_means
  • ITERATION_STEPS big works well when you have |dataset|>>n_trees

There's a lot of moving parts here, but they're a'll related in the principle that using the bias increases performance and reduces both indexing and search times across all dataset sizes.

@irevoire
Copy link
Contributor

irevoire commented Jun 3, 2025

An idea I had was that when using the bias you end up with a lot of similar trees which filter down to the same descendants. If that were true then increasing the number of trees wouldn't yield any gains past a certain point cause we're just re-exploring the same sets.

Yeah, I don't need anything more than that to understand the issue, and it's something I already thought about a long time ago and decided I should not delve into because I didn't have the time. And now I'm sock-less 😂

But yeah, Annoy was designed for Spotify in 2010-ish; they already had tons of documents, and the comment above the 200 seems to indicate that they just did a bunch of manual tests and found that this value was working well for them (and their number of documents).

This is also the case to a lesser degree for the main branch, but there's more randomness cause you ignore the bias for each splitting plane => you get more diverse trees.

I think I made this worse in the past 10 PRs. Before, when we could not make a split, we were picking the items randomly, and now I'm just taking the first half (relying on their IDs).
Maybe we could put that back again, but it should not be that impactful since you're making better trees now.

when n_trees can be big this PR underperformed cause the trees weren't diverse. This was fixed by decreasing ITERATION_STEPS in two_means

In our case, if we want to make that work in the general case, shouldn't the number of step directly depends on the number of documents in the split?
Like 1 to 10% of all the documents should be inspected to create the split? (that will require some manual testing again with the famous technique du doigt mouillé 😭 )


Also, big thanks for that investigation because I was thinking of two different ways to parallelize arroy, and one of them would have been terrible for the tree generation so, that's super cool 🙏

@nnethercott
Copy link
Contributor Author

nnethercott commented Jun 3, 2025

In our case, if we want to make that work in the general case, shouldn't the number of step directly depends on the number of documents in the split?

that's an interesting idea, we could use some fraction of ImmutableSubsetLeafs.len() instead of ITERATION_STEPS to that end maybe.
Another approach would be just to keep ITERATION_STEPS = 1 and scale the number of trees accordingly (easy but wastes memory)

Not sure if this whole topic deserves a PR of its own, or if it should be included here...

@irevoire
Copy link
Contributor

irevoire commented Jun 3, 2025

I would say we include it here and maybe update it later in another PR except if you have concerns about something?
But I would like to keep the relevancy good on main

@nnethercott
Copy link
Contributor Author

I would say we include it here and maybe update it later in another PR except if you have concerns about something? But I would like to keep the relevancy good on main

Changed ITERATION_STEPS from 200->10 which yields better performance, but probably isnt the globally best value. I think an ablation study is needed in a separate PR later to find the best way to autoscale this, but in any case now this PR won't regress perfs on main.

Ideally that next PR^ should be the last time we touch ITERATION_STEPS since updating test snapshots all the time is a recipe for disaster

@nnethercott nnethercott requested a review from irevoire June 4, 2025 09:48
@nnethercott nnethercott force-pushed the add-header-to-normal branch from ee42f1c to 939ae4a Compare June 4, 2025 10:21
Copy link
Contributor

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your investigation, merging this for now and will investigate to find a better value before the release if you don't do it before me 😁

pub left: ItemId,
pub right: ItemId,
pub normal: Option<Cow<'a, UnalignedVector<D::VectorCodec>>>,
pub normal: Option<Leaf<'a, D>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment, if you had updated another kind of node we would have work to do somewhere else but since you're updating the one we're already updating you don't have anything to do in your PR and it's noice 🔥

@irevoire irevoire added this pull request to the merge queue Jun 10, 2025
Merged via the queue into meilisearch:main with commit 693f77d Jun 10, 2025
8 checks passed
@irevoire irevoire added this to the v1.7.0 milestone Jun 10, 2025
@nnethercott nnethercott deleted the add-header-to-normal branch June 10, 2025 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dead code to investigate

3 participants