Skip to content

Remove soar duplicate checking #132617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Aug 9, 2025

Through our various benchmarking runs, I have noticed we do a silly amount of work just handling duplicate vectors for overspill. When it comes to block scoring, it is likely much better to just score the duplicates, and deduplicate later. This indeed is the case, and the performance increases as the number of vector ops increases.

Multi-segment Cohere-wiki-768 8M

I ran every nprobe 5 times and picked the fastest.

CANDIDATE

index_name                      index_type  n_probe  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall     visited  filter_selectivity
------------------------------  ----------  -------  -----------  ----------------  -------------  ------  ------  ----------  ------------------
cohere-wikipedia-docs-768d.vec         ivf       10         7.12              0.00           0.00   140.45    0.80    83108.96                1.00
cohere-wikipedia-docs-768d.vec         ivf       20        10.47              0.00           0.00    95.51    0.86   169324.80                1.00
cohere-wikipedia-docs-768d.vec         ivf       50        19.86              0.00           0.00    50.35    0.91   461667.04                1.00
cohere-wikipedia-docs-768d.vec         ivf      100        33.65              0.00           0.00    29.72    0.94   950007.20                1.00
cohere-wikipedia-docs-768d.vec         ivf      200        57.04              0.00           0.00    17.53    0.95  1797631.04                1.00
cohere-wikipedia-docs-768d.vec         ivf      500       124.30              0.00           0.00     8.05    0.96  4334902.24                1.00
cohere-wikipedia-docs-768d.vec         ivf     1000       236.78              0.00           0.00     4.22    0.96  8521820.48                1.00

BASELINE

index_name                      index_type  n_probe  latency(ms)  net_cpu_time(ms)  avg_cpu_count     QPS  recall     visited  filter_selectivity
------------------------------  ----------  -------  -----------  ----------------  -------------  ------  ------  ----------  ------------------
cohere-wikipedia-docs-768d.vec         ivf       10         7.21              0.00           0.00  138.70    0.81    74077.53                1.00
cohere-wikipedia-docs-768d.vec         ivf       20        10.83              0.00           0.00   92.34    0.86   144966.33                1.00
cohere-wikipedia-docs-768d.vec         ivf       50        21.75              0.00           0.00   45.98    0.91   365150.68                1.00
cohere-wikipedia-docs-768d.vec         ivf      100        38.25              0.00           0.00   26.14    0.93   698105.96                1.00
cohere-wikipedia-docs-768d.vec         ivf      200        65.61              0.00           0.00   15.24    0.95  1278157.01                1.00
cohere-wikipedia-docs-768d.vec         ivf      500       148.98              0.00           0.00    6.71    0.95  2890457.27                1.00
cohere-wikipedia-docs-768d.vec         ivf     1000       281.02              0.00           0.00    3.56    0.95  4939370.44                1.00

Single segment Cohere-wiki-1024 1M

My thought being that maybe larger vectors will make block scoring more expensive, so picking individual vectors would be better. Same methodology as above

Candidate

index_name        index_type  n_probe  latency(ms)  net_cpu_time(ms)  avg_cpu_count      QPS  recall    visited  filter_selectivity
----------------  ----------  -------  -----------  ----------------  -------------  -------  ------  ---------  ------------------
wiki1024en.train                       ivf       10         0.63              0.00           0.00  1587.30    0.81     6389.60                1.00
wiki1024en.train                       ivf       20         0.86              0.00           0.00  1162.79    0.88    12528.48                1.00
wiki1024en.train                       ivf       50         1.43              0.00           0.00   699.30    0.93    30627.04                1.00
wiki1024en.train                       ivf      100         2.30              0.00           0.00   434.78    0.95    61259.84                1.00
wiki1024en.train                       ivf      200         4.12              0.00           0.00   242.72    0.97   122569.44                1.00
wiki1024en.train                       ivf      500         9.64              0.00           0.00   103.73    0.98   307816.80                1.00
wiki1024en.train                       ivf     1000        18.79              0.00           0.00    53.22    0.98   618772.32                1.00

Baseline

index_name        index_type  n_probe  latency(ms)  net_cpu_time(ms)  avg_cpu_count      QPS  recall    visited  filter_selectivity
----------------  ----------  -------  -----------  ----------------  -------------  -------  ------  ---------  ------------------
wiki1024en.train         ivf       10         0.65              0.00           0.00  1538.46    0.82    5680.72                1.00
wiki1024en.train         ivf       20         0.84              0.00           0.00  1190.48    0.88   10677.40                1.00
wiki1024en.train         ivf       50         1.49              0.00           0.00   671.14    0.94   24431.26                1.00
wiki1024en.train         ivf      100         2.41              0.00           0.00   414.94    0.96   47000.85                1.00
wiki1024en.train         ivf      200         4.56              0.00           0.00   219.30    0.97   91284.42                1.00
wiki1024en.train         ivf      500        10.56              0.00           0.00    94.70    0.98  218185.33                1.00
wiki1024en.train         ivf     1000        20.81              0.00           0.00    48.05    0.98  412137.05                1.00

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Aug 9, 2025
Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that the change removes the need for the visitedDocs bitset but it worries me that some of the calls to common APIs now will not be correct. For example the following call:

            if (scoredDocs > 0) {
                knnCollector.incVisitedCount(scoredDocs);
            }

It won't be correct because we are counting twice some documents? Is that a problem?

@benwtrent
Copy link
Member Author

It won't be correct because we are counting twice some documents? Is that a problem?

We may visit the same doc twice, but I think that is ok.

We are using "visited" as a stand in for "number of vector ops". Which is correct and exposed via profiling. The top-hit count is still just being exposed as the k returned by the query (which is uneffected).

What do you think?

@iverase
Copy link
Contributor

iverase commented Aug 11, 2025

I saw in another PR that we might move from visiting x nProbes to have a visited ratio (which I think it is the right approach as clusters are not balanced). This change will have an effect on that ration.

Anyway, I do like to remove the visitedDocs BitSet so I am good with this and trying to make the ration work considering that we might visit a document twice.

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 11, 2025
@elasticsearchmachine elasticsearchmachine merged commit bfefe03 into elastic:main Aug 11, 2025
33 checks passed
@benwtrent benwtrent deleted the remove-soar-duplicate-checking branch August 11, 2025 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >non-issue :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants