Skip to content

Conversation

@viclafargue
Copy link
Contributor

Closes #1632

@jinsolp
Copy link
Contributor

jinsolp commented Dec 11, 2025

Looks like for the example in #1632 doing this fixes the issue. Results in 0.99 recall with correct distances.

However, I have concerns (like you mentioned in the meeting) about self-distances not getting 0 if we add this fix.
This is more pronounced when we have DataT != AccT such as DataT=fp16 and AccT=fp32.

For example, running this script with this fix

from sklearn.datasets import make_blobs
import numpy as np
from cuvs.neighbors import brute_force
import cupy as cp

X, y = make_blobs(n_samples=100_000, n_features=256, centers=10, random_state=42)
X = X.astype(np.float16)
metric = "l2"
k = 15

index = brute_force.build(cp.array(X), metric=metric)
gpu_bf_distances, gpu_bf_indices = brute_force.search(index, cp.array(X), k)
gpu_bf_indices = gpu_bf_indices.copy_to_host()
gpu_bf_distances = gpu_bf_distances.copy_to_host()
print(f"gpu_bf_indices[2]: {gpu_bf_indices[2]}")
print(f"gpu_bf_distances[2]: {gpu_bf_distances[2]}")

prints

gpu_bf_indices[2]: [    2 27250 16513 23365 76060 90416 42661 18825 59056 90326 45336 52750
  6836 45806 74146]
gpu_bf_distances[2]: [ 0.0625   18.050602 18.181753 18.21438  18.220385 18.227512 18.356602
 18.42224  18.495619 18.515512 18.53328  18.58474  18.637842 18.666742
 18.668154]

we can see the self-distance for vector 2 is 0.0625 😔

@viclafargue
Copy link
Contributor Author

cuML does a second pass to recompute the exact distances, and reorder the set nearest neighbors. We could do the same in the Python layer of cuVS, but it wont fix the C++ calls. Ideally the fix should be implemented in C++.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 5, 2026
@cjnolet cjnolet moved this from Todo to In Progress in Vector Search, ML, & Data Mining Release Board Jan 5, 2026
@viclafargue viclafargue changed the title Fix L2 self-neighbor clamping Bruteforce search imprecision Jan 6, 2026
@viclafargue viclafargue changed the title Bruteforce search imprecision Fix bruteforce search imprecision Jan 6, 2026
@viclafargue
Copy link
Contributor Author

viclafargue commented Jan 6, 2026

I updated the PR to move away from updating the mechanism clamping the self-neighbor, and instead implemented a refining step on the search results of bruteforce. Is this something we would like to keep? Or should we look for an other solution? We could instead implement the refining in the cuVS Python layer and after C++ calls in consumers of the cuVS bruteforce algorithm (like UMAP).

@cjnolet
Copy link
Member

cjnolet commented Jan 6, 2026

I would prefer that we not slow this algorithm down further by having to run multiple passes if it can at all be avoided. I'd much rather expend some additonal time thinking through a more performant and robust method to do this.

We have really been suffering on the perf side lately by making changes that, when done in isolation don't affect perf by all that much, but when taken all together have had a huge impact...

@viclafargue viclafargue requested a review from a team as a code owner January 6, 2026 16:58
@viclafargue
Copy link
Contributor Author

I removed the refinement from the C++ side and replaced it with an optional second pass available in the Python layer through a two_pass_precision argument similar to cuML.

@jinsolp jinsolp assigned viclafargue and unassigned jinsolp Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

[BUG] Brute Force returns wrong distances

3 participants