Skip to content

Commit 45a54fa

Browse files
derrickburnsclaude
andcommitted
feat: Add Mini-Batch K-Means implementation
Implements Mini-Batch K-Means algorithm for efficient clustering of large datasets: - MiniBatchKMeans estimator with Spark ML API compatibility - Incremental center updates using η = 1/(count+1) learning rate - Support for all Bregman divergences (SE, KL, Itakura-Saito, L1, spherical) - Early stopping based on no improvement for N consecutive batches - Configurable batch size, reassignment ratio, and convergence tolerance Key parameters: - batchSize: samples per mini-batch (default: 1024) - maxNoImprovement: early stopping patience (default: 10) - reassignmentRatio: for empty cluster handling (default: 0.01) Test suite includes 13 tests covering: - Basic clustering with various divergences - Early stopping behavior - Deterministic results with fixed seed - Parameter validation Reference: Sculley (2010) "Web-Scale K-Means Clustering" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent fc4d8c0 commit 45a54fa

File tree

3 files changed

+925
-1
lines changed

3 files changed

+925
-1
lines changed

ROADMAP.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Roadmap: Generalized K-Means Clustering
22

3-
> **Last Updated:** 2025-12-15 (RDD API Removed)
3+
> **Last Updated:** 2025-12-15 (New Algorithm Roadmap)
44
> **Status:** Active planning document
55
> **Maintainer Note:** Claude should inspect and update this file as changes are made.
66
@@ -162,6 +162,50 @@ This document tracks planned improvements, technical debt, and future directions
162162
- **Reference:** Kulis & Jordan (2012): "Revisiting k-means: New Algorithms via Bayesian Nonparametrics"
163163
- **Status:** Completed 2025-12-15
164164

165+
### 3.4 Add Mini-Batch K-Means (P1)
166+
- **Motivation:** Orders of magnitude faster for very large datasets; standard in scikit-learn
167+
- **Algorithm:** Process random mini-batches instead of full data per iteration
168+
- Sample batch of size `batchSize` at each iteration
169+
- Update centers using weighted running average
170+
- Convergence based on center stability across batches
171+
- **Key parameters:**
172+
- `batchSize`: Number of samples per mini-batch (default: 1024)
173+
- `maxNoImprovement`: Early stopping after N batches without improvement (default: 10)
174+
- `reassignmentRatio`: Fraction of batch to reassign for empty clusters (default: 0.01)
175+
- **Files to create:**
176+
- `src/main/scala/com/massivedatascience/clusterer/ml/MiniBatchKMeans.scala`
177+
- `src/test/scala/com/massivedatascience/clusterer/ml/MiniBatchKMeansSuite.scala`
178+
- **Reference:** Sculley (2010): "Web-Scale K-Means Clustering"
179+
- **Status:** In Progress
180+
181+
### 3.5 Add Constrained/Balanced K-Means (P2)
182+
- **Motivation:** Enforce min/max cluster sizes for workload balancing, equal-sized segments
183+
- **Algorithm:** Modified Lloyd's with Hungarian algorithm or min-cost flow for assignment
184+
- Assignment step solves balanced assignment problem
185+
- Update step remains standard centroid computation
186+
- **Key parameters:**
187+
- `minClusterSize`: Minimum points per cluster (default: 1)
188+
- `maxClusterSize`: Maximum points per cluster (default: n/k)
189+
- `balanceMode`: "soft" (penalty) or "hard" (strict constraint)
190+
- **Files to create:**
191+
- `src/main/scala/com/massivedatascience/clusterer/ml/BalancedKMeans.scala`
192+
- `src/test/scala/com/massivedatascience/clusterer/ml/BalancedKMeansSuite.scala`
193+
- **Reference:** Malinen & Fränti (2014): "Balanced K-Means for Clustering"
194+
- **Status:** Not Started
195+
196+
### 3.6 Bregman-Native k-means++ Seeding (P2)
197+
- **Motivation:** Current k-means|| uses SE distances for seeding even with non-SE divergences
198+
- **Algorithm:** k-means++ probability-proportional seeding using the actual Bregman divergence
199+
- Select first center uniformly at random
200+
- Select subsequent centers with probability proportional to D(x, nearest_center)
201+
- Works for any Bregman divergence (KL, IS, etc.)
202+
- **Key insight:** Better initialization leads to faster convergence and better local optima
203+
- **Files to modify:**
204+
- `src/main/scala/com/massivedatascience/clusterer/ml/GeneralizedKMeans.scala` (initializeKMeansPP)
205+
- Add tests for KL/IS seeding quality
206+
- **Reference:** Nock, Luosto & Kivinen (2008): "Mixed Bregman Clustering with Approximation Guarantees"
207+
- **Status:** Not Started
208+
165209
---
166210

167211
## 4. Performance Improvements (P2)

0 commit comments

Comments
 (0)