-
Notifications
You must be signed in to change notification settings - Fork 17
Add bias info to SplitPlaneNormal #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
margin_no_header is a footgun here, since signature for margin is the same its easy to confound the two. this is bad since we may build a tree not using the hyperplane bias but then configure the reader to use it (for instance)
|
Here's a bit more context on why this PR is needed, both from a theoretical and practical perspective. understanding the bias termWhen the A hyperplane is a set of points satisfying That's the same as what's in arroy (but currently its ignored) Here's what the plane should look like (bias = Here's what it currently looks like (bias=0): consequences
Note This is generally OK for Cosine and Dot distances since the data is normalized to the unit circle before indexing so the bias is close to 0 anyways. We're more worried about Euclidean and Manhattan... resultsimpact on recall@kWe'll use the vector store benchmark for this. In the case where
|
| n_trees=1 | n_trees=10 | n_trees=100 | n_trees=500 | |
|---|---|---|---|---|
| main | 2.08s | 5.93s | 55.94s | 649.31s |
| pr | 3.44s | 5.12s | 48.82s | 225.18s |
a weird thing
If we run the benchmark and use the default number of trees, the approach with the proper use of the bias actually underperforms what's in arroy currently:
My guess is that arroy overcomes poor quality trees with brute force search (high default number of trees, high oversampling) so the weak learners can ensemble better than the trees we produced by using the bias correctly.
Warning
The target_n_trees is proprtional to the embedding dimension, irrespective of the number of vectors being indexed.
from the code:
// 1. How many descendants do we need:
let descendant_required = item_indices.len() / dimensions;
// 2. Find the number of tree nodes required per trees
let tree_nodes_per_tree = descendant_required + 1;
// 3. Find the number of tree required to get as many tree nodes as item:
let mut nb_trees = item_indices.len() / tree_nodes_per_tree;In the above nb_trees = n/((n/d)+1) = d/(1+d/n) ≈ d for any collection we're indexing. So for my test above with 10k vectors we're using nearly 1k trees by default. This might explain the performance difference.
conclusions
- for large datasets using the bias generally yields significant performance improvements in terms of recall and indexing time
- In cases where increasing
n_treesis not feasible, using the bias properly should create higher quality trees - Performance gains depend on dataset & embeddings structure
different platforms show varying numbers of digits in f32 debug impl => failing tests on linux vs macos runners. to fix this we ensure only 4 digits are printed
1c484e4 to
8b48f83
Compare
irevoire
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that arroy overcomes poor quality trees with brute force search (high default number of trees, high oversampling) so the weak learners can ensemble better than the trees we produced by using the bias correctly.
The strange thing is that your PR doesn't reduce the number of trees right? So you actually bruteforce as much as on main but it doesn't give the right results 🤔
I was on my phone last time and didn't notice how much recall we were losing, that's huge, we can't really merge the PR in this state 😱
Maybe by updating the number of trees, you could try to get something that doesn't impact the recall that much?
Also, did you try to run the benchmark with more documents? Like 100_000 and maybe 1_000_000 just to give us an idea of how it "scales"
| pub left: ItemId, | ||
| pub right: ItemId, | ||
| pub normal: Option<Cow<'a, UnalignedVector<D::VectorCodec>>>, | ||
| pub normal: Option<Leaf<'a, D>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a breaking change, we must upgrade all the SplitPlaneNormal in the previous version of the database.
The good thing is we're already doing it in this version, so it's the best version to do this breaking change, three breaking changes on the SplitPlaneNormal, all of its fields must be rewritten 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't tell if this is just a comment or a request for changes ahah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment, if you had updated another kind of node we would have work to do somewhere else but since you're updating the one we're already updating you don't have anything to do in your PR and it's noice 🔥
| None | ||
| } else { | ||
| Some(normal) | ||
| let header = D::new_header(&vector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't have the actual values that have been used in two_means, what will happen here?
The bias will always be zero, from what I see in the test, and the only way to retrieve the actual bias is to re-index everything, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D::new_header() creates a NodeHeader with bias=0.0 so it's effectively a no-op on indexes built with earlier version of arroy (which implicitly assume the bias is 0 anyways) - performance will remain the same
If users want to opt into the "improved" indexes they'll have to re-index everything like you were saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit sad we don't have any way to recompute it later on, but I don't see how we could either since a split node doesn't know the full list of items it stores 🤔
|
@irevoire I think I have an answer for the performance issues. Hold onto your socks. An idea I had was that when using the bias you end up with a lot of similar trees which filter down to the same descendants. If that were true then increasing the number of trees wouldn't yield any gains past a certain point cause we're just re-exploring the same sets. Ok but why do we end up with the same trees when using the bias ? Let's look at const ITERATION_STEPS: usize = 200;When This is also the case to a lesser degree for the main branch, but there's more randomness cause you ignore the bias for each splitting plane => you get more diverse trees. what's the fix? show me the proof This PR with default This PR with default Only downside to making This PR with making sense of everything
There's a lot of moving parts here, but they're a'll related in the principle that using the bias increases performance and reduces both indexing and search times across all dataset sizes. |
Yeah, I don't need anything more than that to understand the issue, and it's something I already thought about a long time ago and decided I should not delve into because I didn't have the time. And now I'm sock-less 😂 But yeah, Annoy was designed for Spotify in 2010-ish; they already had tons of documents, and the comment above the
I think I made this worse in the past 10 PRs. Before, when we could not make a split, we were picking the items randomly, and now I'm just taking the first half (relying on their IDs).
In our case, if we want to make that work in the general case, shouldn't the number of step directly depends on the number of documents in the split? Also, big thanks for that investigation because I was thinking of two different ways to parallelize arroy, and one of them would have been terrible for the tree generation so, that's super cool 🙏 |
Co-authored-by: Tamo <[email protected]>
that's an interesting idea, we could use some fraction of Not sure if this whole topic deserves a PR of its own, or if it should be included here... |
|
I would say we include it here and maybe update it later in another PR except if you have concerns about something? |
Changed Ideally that next PR^ should be the last time we touch |
ee42f1c to
939ae4a
Compare
irevoire
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your investigation, merging this for now and will investigate to find a better value before the release if you don't do it before me 😁
| pub left: ItemId, | ||
| pub right: ItemId, | ||
| pub normal: Option<Cow<'a, UnalignedVector<D::VectorCodec>>>, | ||
| pub normal: Option<Leaf<'a, D>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment, if you had updated another kind of node we would have work to do somewhere else but since you're updating the one we're already updating you don't have anything to do in your PR and it's noice 🔥
















Pull Request
Currently arroy does not use the bias of the hyperplane in
create_split, even though it is calculated. This is OK for Cosine and Dot distances where the bias is close to zero (vectors are normalized), but is problematic for Euclidean and Manhattan.Check comments for more details/results.
Caution
PR is db breaking.
Fun fact: by serializing the node header you gain a lot more freedom in how splits are defined. For example, the hamming distance (#124) only requires a random index to define a split -- this can be chucked in the node header and the resulting leaf is much cheaper to store than a sparse vector (12 bytes vs [u8; n_dimensions])
Related issue
Fixes #81
What does this PR do?
Leaf<'_, Distance>in theSplitPlaneNormaland updates heed trait impls for se/deserializationmargin_no_headerfrom Distance trait in favour ofmarginBinaryQuantizedManhattanandBinaryQuantizedEuclideanthe same way as their non-quantized counterpartsstd::fmt::Debugimpl for all node headers to fix testing on different runnersImpacts
A more "balanced" split:

PR checklist
Please check if your PR fulfills the following requirements:
Thank you so much for contributing to Meilisearch!