Add the current count of vectors in a cluster in hierarchical k-means #132587
+187
−109
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commits adds a new parameter to the k-means result that contains the current count of vectors in a cluster. This array is always up-to-date so at anytime it contains the number of vectors assign to a cluster. This array is used in the places where we are counting the number of vectors assigned, both in the codec as well as in the algorithm itself. But more important, this will allow us to limit the number of vectors in a cluster if we wish to, in order to build more balanced clusters.
I did not notice any performance regression or changes in recall after this change. The only difference with the previous version is that when we update the centroids after a assignment step, we update the centroids using all the assigned vectors, while before we were using only the sampled vectors.