-
Notifications
You must be signed in to change notification settings - Fork 17
use the bias as intended #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Yo @irevoire I revived this PR in a fork cause it's concerning mathematically that the bias still isn't being used in arroy. I'll drop my thoughts below (déso pour le pavé) TL;DR: for large datasets with moderate understanding the bias termWhen the A hyperplane is a set of points satisfying That's the same as what's in arroy (but currently its ignored) Here's what the plane should look like (bias = Here's what it currently looks like (bias=0): consequences
Note This is generally OK for Cosine and Dot distances since the data is normalized to the unit circle before indexing so the bias is close to 0 anyways. We're more worried about Euclidean and Manhattan... resultsimpact on recall@kWe'll use the vector store benchmark for this. In the case where
|
| n_trees=1 | n_trees=10 | n_trees=100 | n_trees=500 | |
|---|---|---|---|---|
| main | 2.08s | 5.93s | 55.94s | 649.31s |
| pr | 3.44s | 5.12s | 48.82s | 225.18s |
a weird thing
If we run the benchmark and use the default number of trees, the approach with the proper use of the bias actually underperforms what's in arroy currently:
My guess is that arroy overcomes poor quality trees with brute force search (high default number of trees, high oversampling) so the weak learners can ensemble better than the trees we produced by using the bias correctly.
Warning
The target_n_trees is proprtional to the embedding dimension, irrespective of the number of vectors being indexed.
from the code:
// 1. How many descendants do we need:
let descendant_required = item_indices.len() / dimensions;
// 2. Find the number of tree nodes required per trees
let tree_nodes_per_tree = descendant_required + 1;
// 3. Find the number of tree required to get as many tree nodes as item:
let mut nb_trees = item_indices.len() / tree_nodes_per_tree;In the above nb_trees = n/((n/d)+1) = d/(1+d/n) ≈ d for any collection we're indexing. So for my test above with 10k vectors we're using nearly 1k trees by default. This might explain the performance difference.
conclusions
- for large datasets using the bias generally yields significant performance improvements in terms of recall and indexing time
- In cases where increasing
n_treesis not feasible, using the bias properly should create higher quality trees - Performance gains depend on dataset & embeddings structure
Very curious to hear your thoughts on this
|
Oh, awesome investigation. Thanks again for your work, @nnethercott. The truth is we only use cosine and its binary quantized version on Meilisearch, so I didn't spend much time on the issue. However, I'm totally up for merging your PR.
It also leads to longer search requests, which is the biggest issue for us here 😭
Not sure I understood these two points, sorry 🤔
Yep, that's an issue as well. I think there is an issue somewhere, but I couldn't find it. ——- TLDR: there was issue and which I knew about but never had the time to work on it. |
ok nice, I'll open the PR then :)
I'll take a look here too, i think the easiest fix is just to take |












Pull Request
Related issue
Fixes #81
What does this PR do?
On main:
After this PR:
And for the binary quantization?
Currently, I don't really know how we should implements that for the binary quantization. I did a few tests here and some improved the recall sometimes for almost 10 points:
