-
Notifications
You must be signed in to change notification settings - Fork 25.6k
DiskBBQ - Adapt visited_ratio based on query - segment affinity in multi segment scenario #132396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…csearch into diskbbq_segment_affinity
…csearch into diskbbq_segment_affinity
…csearch into diskbbq_segment_affinity
…, minor improvements
…affinity as visited ratio modifier
I benchmarked this change, just using regular indexing & merge policy. I wonder if there is a bug? For very low percentage visits (1% or less), its worse. Then for higher visit percentages, its only marginally better. I also benchmarked with a 10% selectivity filter, and there is zero difference between the two implementations.
BASELINE
CANDIDATE:
|
This makes sense. Where the PR is at currently we cap you at 1% as the lowest threshold. So likely what happens at 1% right now is we explore all segments at a 1% ratio at least but may explore more. So for configured 1% we're exploring too much. I can try to deal with this by saying we won't do this logic at all if we have a ratio that's below say 5%. So at 1% it would revert to the behavior on main.
that surprises me given what @tteofili was showing me. Let me see if I can replicate and dig into those numbers.
I'll try to run with this as well. |
…tio, lower all thresholds to favor smaller visit ratios
I cleaned up the magic numbers a bit. I'm seeing decent improvements with dbpedia. I'll run some additional datasets here later tonight. I set the thresholds to be below half a percent where we cut off and lowered the min affinity from 1% to 0.1%. So you should see an actual impact at those levels you were testing @benwtrent.
|
I ran dbpedia again with the same ingest this time.
However, when I started to run cohere I got similar results to what you originally reported @benwtrent. I'm seeing when filtering very little difference between candidate and baseline. And when not filtering I'm seeing the baseline sometimes doing better. I'm going to run it a few more times tomorrow. I see a lot of variation with the default merge policy across a few runs I did, which was a little surprising. And I'll see if there's anything that can be done about improvements here and post some additional numbers. |
The concurrent merge scheduler may or may not kick off a merge due to thread availability and memory pressure. So, its not totally surprising that the number of segments jump between tier sizes (e.g. 5-10 in count) |
…rease rather than increase
…csearch into diskbbq_segment_affinity
I got to stabilize the results such that this can work both with the default (tiered) merge policy and with no merge policy, without breaking recall. Cohere Main
Candidate
DBPedia Main
Candidate
the final result is that the reduction in |
As discussed here, we might favor / penalize exploration of certain segments based on query vs segment affinity.
This does that by leveraging information about segment density (vectors per cluster), query to global centroid similarity.
Segments with higher affinity get increased
visited_ratio
, whereas segments with lower affinity see theirvisited_ratio
decreased, optionally segments with very small affinity might not get explored.