[DISKBBQ] Don't spill vectors that are numerically equivalent to the centroid #132706
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, if a vector is numerically equivalent to the centroid (the distance between the vector and the centroid is lower than
SOAR_MIN_DISTANCE), we spill the vector to the second nearest centroid. We think this is not needed because if the vector is really a near neighbour, we expect the centroid to be one of the centroids searched so spilling it to the second nearest does not really provide much value. In addition, in many cases this situation indicates a degenerated situation where the centroid is populated with the same vector so spilling all this vectors to the nearest centroid is not good.Therefore this commit proposes that in the degenerated case, where the vector is equivalent to the centroid, the vector does not get a soar assignment, which is defined as a -1 in the soar assignments array. This commit adds a couple of test with degenerated distribution of vectors that makes sure we are handling the situation downstream.