Add bias info to SplitPlaneNormal #132

nnethercott · 2025-05-31T15:16:27Z

Pull Request

Currently arroy does not use the bias of the hyperplane in create_split, even though it is calculated. This is OK for Cosine and Dot distances where the bias is close to zero (vectors are normalized), but is problematic for Euclidean and Manhattan.

Check comments for more details/results.

Caution

PR is db breaking.

Fun fact: by serializing the node header you gain a lot more freedom in how splits are defined. For example, the hamming distance (#124) only requires a random index to define a split -- this can be chucked in the node header and the resulting leaf is much cheaper to store than a sparse vector (12 bytes vs [u8; n_dimensions])

Related issue

Fixes #81

What does this PR do?

Stores the normal as a Leaf<'_, Distance> in the SplitPlaneNormal and updates heed trait impls for se/deserialization
Removes margin_no_header from Distance trait in favour of margin
Calculates the bias for BinaryQuantizedManhattan and BinaryQuantizedEuclidean the same way as their non-quantized counterparts
Adds new std::fmt::Debug impl for all node headers to fix testing on different runners
Updates insta snapshots for tests

Impacts

Faster indexing and lookup times due to smaller tree depth on average as splits are generally more balanced (see pic below)
recall@k improves substantially for large datasets using Euclidean/Hamming distance

A more "balanced" split:

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

margin_no_header is a footgun here, since signature for margin is the same its easy to confound the two. this is bad since we may build a tree not using the hyperplane bias but then configure the reader to use it (for instance)

nnethercott · 2025-05-31T15:18:21Z

Here's a bit more context on why this PR is needed, both from a theoretical and practical perspective.

understanding the bias term

When the arroy::Writer is indexing, it builds separating hyperplanes to split leafs into two clusters. First two_means finds the centroids, $\{c_{0}, c_{1}\}$ and then we define a plane with unit normal, $n$, such that $n \cdot c_{0}>0$ and $n \cdot c_{1}<0$

A hyperplane is a set of points satisfying $n \cdot x+b=0$. We said earlier the plane we build passes through the midpoint of the centroids, so to find the bias we just plug that in and get

$$\begin{align*} n \cdot (c_{0} + c_{1})/2 + b &= 0 \\ \implies - n \cdot (c_{0} + c_{1})/2 &= b \end{align*}$$

That's the same as what's in arroy (but currently its ignored)

Here's what the plane should look like (bias = $- n \cdot (c_{0} + c_{1})/2$):

Here's what it currently looks like (bias=0):

consequences

the margin for two points belonging to separate clusters can be the same
we potentially get poor splits at every iteration since the hyperplane doesn't necessarily cut the leaf subset in half
- leads to substantially longer indexing times due to more recursive calls until fit_in_descendant evaluates to true

Note

This is generally OK for Cosine and Dot distances since the data is normalized to the unit circle before indexing so the bias is close to 0 anyways. We're more worried about Euclidean and Manhattan...

results

impact on recall@k

We'll use the vector store benchmark for this. In the case where |dataset|>>n_trees using the bias generally outperforms the status quo. We purposely limit the number of trees to assess the quality of learned SplitPlaneNormals.

`|dataset|=100_000, n_trees = 1`

Click here

On main:

After this PR:

`|dataset|=100_000, n_trees = 100`

Click here

On main:

After this PR:

`|dataset|=10_000, n_trees = 1`

Click here

On main:

After this PR:

`|dataset|=10_000, n_trees = 100`

Click here

On main:

After this PR:

impact on indexing time

We'll just look at one dataset and the Euclidean distance since the idea is more or less the same for other datasets/distances. We expect main to be slower since imbalanced splits (e.g. those arising from not using the bias) require more recursive branching.

Here are the results for indexing 100000 vectors of dimension 768:

	n_trees=1	n_trees=10	n_trees=100	n_trees=500
main	2.08s	5.93s	55.94s	649.31s
pr	3.44s	5.12s	48.82s	225.18s

a weird thing

If we run the benchmark and use the default number of trees, the approach with the proper use of the bias actually underperforms what's in arroy currently:

On main:

After this PR:

My guess is that arroy overcomes poor quality trees with brute force search (high default number of trees, high oversampling) so the weak learners can ensemble better than the trees we produced by using the bias correctly.

Warning

The target_n_trees is proprtional to the embedding dimension, irrespective of the number of vectors being indexed.

from the code:

// 1. How many descendants do we need:
let descendant_required = item_indices.len() / dimensions;
// 2. Find the number of tree nodes required per trees
let tree_nodes_per_tree = descendant_required + 1;
// 3. Find the number of tree required to get as many tree nodes as item:
let mut nb_trees = item_indices.len() / tree_nodes_per_tree;

In the above nb_trees = n/((n/d)+1) = d/(1+d/n) ≈ d for any collection we're indexing. So for my test above with 10k vectors we're using nearly 1k trees by default. This might explain the performance difference.

conclusions

for large datasets using the bias generally yields significant performance improvements in terms of recall and indexing time
In cases where increasing n_trees is not feasible, using the bias properly should create higher quality trees
Performance gains depend on dataset & embeddings structure

different platforms show varying numbers of digits in f32 debug impl => failing tests on linux vs macos runners. to fix this we ensure only 4 digits are printed

irevoire

My guess is that arroy overcomes poor quality trees with brute force search (high default number of trees, high oversampling) so the weak learners can ensemble better than the trees we produced by using the bias correctly.

The strange thing is that your PR doesn't reduce the number of trees right? So you actually bruteforce as much as on main but it doesn't give the right results 🤔

I was on my phone last time and didn't notice how much recall we were losing, that's huge, we can't really merge the PR in this state 😱
Maybe by updating the number of trees, you could try to get something that doesn't impact the recall that much?
Also, did you try to run the benchmark with more documents? Like 100_000 and maybe 1_000_000 just to give us an idea of how it "scales"

irevoire · 2025-06-02T20:55:23Z

src/node.rs

    pub left: ItemId,
    pub right: ItemId,
-    pub normal: Option<Cow<'a, UnalignedVector<D::VectorCodec>>>,
+    pub normal: Option<Leaf<'a, D>>,


Since this is a breaking change, we must upgrade all the SplitPlaneNormal in the previous version of the database.
The good thing is we're already doing it in this version, so it's the best version to do this breaking change, three breaking changes on the SplitPlaneNormal, all of its fields must be rewritten 👌

can't tell if this is just a comment or a request for changes ahah

Just a comment, if you had updated another kind of node we would have work to do somewhere else but since you're updating the one we're already updating you don't have anything to do in your PR and it's noice 🔥

src/node.rs

irevoire · 2025-06-03T10:23:06Z

src/node.rs

                    None
                } else {
-                    Some(normal)
+                    let header = D::new_header(&vector);


Since we don't have the actual values that have been used in two_means, what will happen here?
The bias will always be zero, from what I see in the test, and the only way to retrieve the actual bias is to re-index everything, right?

D::new_header() creates a NodeHeader with bias=0.0 so it's effectively a no-op on indexes built with earlier version of arroy (which implicitly assume the bias is 0 anyways) - performance will remain the same
If users want to opt into the "improved" indexes they'll have to re-index everything like you were saying.

It's a bit sad we don't have any way to recompute it later on, but I don't see how we could either since a split node doesn't know the full list of items it stores 🤔

nnethercott · 2025-06-03T14:47:10Z

@irevoire I think I have an answer for the performance issues. Hold onto your socks.

An idea I had was that when using the bias you end up with a lot of similar trees which filter down to the same descendants. If that were true then increasing the number of trees wouldn't yield any gains past a certain point cause we're just re-exploring the same sets.

Ok but why do we end up with the same trees when using the bias ?

Let's look at two_means real quick, specifically at this :

const ITERATION_STEPS: usize = 200;

When ITERATION_STEPS goes to infinity you end up with the same two centroids for any random seed. That's where the problem is for us; since ITERATION_STEPS is big the centroids aren't changing all that much => you end up with many trees that have roughly the same splits => performance is capped.

This is also the case to a lesser degree for the main branch, but there's more randomness cause you ignore the bias for each splitting plane => you get more diverse trees.

what's the fix?
Make your trees more diverse by decreasing ITERATION_STEPS

show me the proof
Here's the main branch with default n_trees and ITERATION_STEPS=200:

This PR with default n_trees and ITERATION_STEPS=1

This PR with default n_trees, ITERATION_STEPS=1, but no bias (it's worse, nice !)

Only downside to making ITERATION_STEPS small is increased search times... But wait ! Since the splits are better we can use less trees and still beat the recall on main !

This PR with n_trees=400 and ITERATION_STEPS=1 -- faster and more accurate than main, nice !

making sense of everything
Here's my take on this whole subject;

when |dataset|>>n_trees using the bias is better than not using it
when n_trees can be big this PR underperformed cause the trees weren't diverse. This was fixed by decreasing ITERATION_STEPS in two_means
ITERATION_STEPS big works well when you have |dataset|>>n_trees

There's a lot of moving parts here, but they're a'll related in the principle that using the bias increases performance and reduces both indexing and search times across all dataset sizes.

irevoire · 2025-06-03T15:48:02Z

An idea I had was that when using the bias you end up with a lot of similar trees which filter down to the same descendants. If that were true then increasing the number of trees wouldn't yield any gains past a certain point cause we're just re-exploring the same sets.

Yeah, I don't need anything more than that to understand the issue, and it's something I already thought about a long time ago and decided I should not delve into because I didn't have the time. And now I'm sock-less 😂

But yeah, Annoy was designed for Spotify in 2010-ish; they already had tons of documents, and the comment above the 200 seems to indicate that they just did a bunch of manual tests and found that this value was working well for them (and their number of documents).

This is also the case to a lesser degree for the main branch, but there's more randomness cause you ignore the bias for each splitting plane => you get more diverse trees.

I think I made this worse in the past 10 PRs. Before, when we could not make a split, we were picking the items randomly, and now I'm just taking the first half (relying on their IDs).
Maybe we could put that back again, but it should not be that impactful since you're making better trees now.

when n_trees can be big this PR underperformed cause the trees weren't diverse. This was fixed by decreasing ITERATION_STEPS in two_means

In our case, if we want to make that work in the general case, shouldn't the number of step directly depends on the number of documents in the split?
Like 1 to 10% of all the documents should be inspected to create the split? (that will require some manual testing again with the famous technique du doigt mouillé 😭 )

Also, big thanks for that investigation because I was thinking of two different ways to parallelize arroy, and one of them would have been terrible for the tree generation so, that's super cool 🙏

Co-authored-by: Tamo <[email protected]>

nnethercott · 2025-06-03T20:45:17Z

In our case, if we want to make that work in the general case, shouldn't the number of step directly depends on the number of documents in the split?

that's an interesting idea, we could use some fraction of ImmutableSubsetLeafs.len() instead of ITERATION_STEPS to that end maybe.
Another approach would be just to keep ITERATION_STEPS = 1 and scale the number of trees accordingly (easy but wastes memory)

Not sure if this whole topic deserves a PR of its own, or if it should be included here...

irevoire · 2025-06-03T22:28:31Z

I would say we include it here and maybe update it later in another PR except if you have concerns about something?
But I would like to keep the relevancy good on main

nnethercott · 2025-06-04T09:18:58Z

I would say we include it here and maybe update it later in another PR except if you have concerns about something? But I would like to keep the relevancy good on main

Changed ITERATION_STEPS from 200->10 which yields better performance, but probably isnt the globally best value. I think an ablation study is needed in a separate PR later to find the best way to autoscale this, but in any case now this PR won't regress perfs on main.

Ideally that next PR^ should be the last time we touch ITERATION_STEPS since updating test snapshots all the time is a recipe for disaster

irevoire

Thanks a lot for your investigation, merging this for now and will investigate to find a better value before the release if you don't do it before me 😁

irevoire · 2025-06-10T08:55:42Z

src/node.rs

    pub left: ItemId,
    pub right: ItemId,
-    pub normal: Option<Cow<'a, UnalignedVector<D::VectorCodec>>>,
+    pub normal: Option<Leaf<'a, D>>,


Just a comment, if you had updated another kind of node we would have work to do somewhere else but since you're updating the one we're already updating you don't have anything to do in your PR and it's noice 🔥

nnethercott added 5 commits May 29, 2025 17:48

feat: make normal a Leaf<'_, Distance>

a70b1dd

kill margin_no_header

ed32f84

margin_no_header is a footgun here, since signature for margin is the same its easy to confound the two. this is bad since we may build a tree not using the hyperplane bias but then configure the reader to use it (for instance)

feat: add quantized bias

4498f5e

cargo fmt

b0a056f

bump tests

69b6185

nnethercott added 3 commits May 31, 2025 18:32

add debug impls for all node headers

633b097

different platforms show varying numbers of digits in f32 debug impl => failing tests on linux vs macos runners. to fix this we ensure only 4 digits are printed

revert clippy fix for this line

86b3603

reupdate snapshots after new debug impl

8b48f83

nnethercott force-pushed the add-header-to-normal branch from 1c484e4 to 8b48f83 Compare May 31, 2025 16:33

nnethercott mentioned this pull request Jun 2, 2025

Implement Hamming distance for binary strings #124

Open

3 tasks

irevoire suggested changes Jun 3, 2025

View reviewed changes

Kerollmops added the db-breaking label Jun 3, 2025

Apply suggestions from code review

9912f73

Co-authored-by: Tamo <[email protected]>

nnethercott requested a review from irevoire June 4, 2025 09:48

decrease iteration steps

939ae4a

nnethercott force-pushed the add-header-to-normal branch from ee42f1c to 939ae4a Compare June 4, 2025 10:21

irevoire approved these changes Jun 10, 2025

View reviewed changes

irevoire added this pull request to the merge queue Jun 10, 2025

Merged via the queue into meilisearch:main with commit 693f77d Jun 10, 2025
8 checks passed

irevoire added this to the v1.7.0 milestone Jun 10, 2025

nnethercott deleted the add-header-to-normal branch June 10, 2025 09:15

irevoire mentioned this pull request Jun 12, 2025

Parallelize arroy again* #130

Merged

nnethercott mentioned this pull request Jun 12, 2025

Something else I noticed is that 12% of the time is spent on this function: #136

Open

irevoire mentioned this pull request Jun 23, 2025

Still too slow when there is not enough RAM #145

Open

Add bias info to SplitPlaneNormal #132

Add bias info to SplitPlaneNormal #132

Uh oh!

Conversation

nnethercott commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Related issue

What does this PR do?

Impacts

PR checklist

Uh oh!

nnethercott commented May 31, 2025

understanding the bias term

consequences

results

impact on recall@k

|dataset|=100_000, n_trees = 1

|dataset|=100_000, n_trees = 100

|dataset|=10_000, n_trees = 1

|dataset|=10_000, n_trees = 100

impact on indexing time

a weird thing

conclusions

Uh oh!

irevoire left a comment

Choose a reason for hiding this comment

Uh oh!

irevoire Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

nnethercott Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

irevoire Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

irevoire Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

nnethercott Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

irevoire Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

nnethercott commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irevoire commented Jun 3, 2025

Uh oh!

nnethercott commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

irevoire commented Jun 3, 2025

Uh oh!

nnethercott commented Jun 4, 2025

Uh oh!

irevoire left a comment

Choose a reason for hiding this comment

Uh oh!

irevoire Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nnethercott commented May 31, 2025 •

edited

Loading

`|dataset|=100_000, n_trees = 1`

`|dataset|=100_000, n_trees = 100`

`|dataset|=10_000, n_trees = 1`

`|dataset|=10_000, n_trees = 100`

nnethercott commented Jun 3, 2025 •

edited

Loading

nnethercott commented Jun 3, 2025 •

edited

Loading