Skip to content

Conversation

@tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jan 9, 2026

[01/09/2026]

  1. Mathematically equivalent to current kmeans. This PR just brings a batch-size parameter to load and compute cluster assignments and (weighted) centroid adjustments on batches of the dataset. The final centroid 'updates' i.e. a single kmeans iteration only completes when all these accumulated sums are averaged once the whole dataset pass has completed. Distinction from miniBatchKmeans: The centroid updates are done for each batch in miniBatchKmeans (faster to converge).
  2. Binary size:
    While the header is different for the batched approach, I've put the batched fit functions into the same TU as kmeans_fit. Common functions such as minClusterAndDistance with the same template params SHOULD not recompile.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@tarang-jain tarang-jain added feature request New feature or request non-breaking Introduces a non-breaking change cpp labels Jan 9, 2026
@tarang-jain tarang-jain self-assigned this Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpp feature request New feature or request non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

1 participant