Skip to content

Commit 2cbd3f6

Browse files
derrickburnsclaude
andcommitted
feat: Add RobustKMeans, tests for new components, and bug fixes
New Features: - RobustKMeans estimator with trim, noise_cluster, and m_estimator modes for outlier-resistant clustering with outlier score output - Comprehensive test suites for previously untested components Test Suites Added (108 new tests, 716 total): - OutlierDetectionSuite: 16 tests for distance-based/trimmed detection - SparseBregmanKernelSuite: 28 tests for sparse SE, KL, L1, Spherical kernels - ConstraintsSuite: 30 tests for must-link/cannot-link constraints - ConstrainedKMeansSuite: 17 tests for semi-supervised clustering - RobustKMeansSuite: 17 tests for robust clustering with persistence - ExtendedPersistenceSuite: 5 tests for new model persistence Bug Fixes: - SparseSEKernel.divergenceSparse missing 0.5 factor - AgglomerativeBregmanModel persistence serializing IntParam instead of value 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent d5c40d6 commit 2cbd3f6

30 files changed

+7183
-439
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ project/target/
1515
project/project/
1616
project/metals.sbt
1717
.bsp/
18+
.sbt/
1819

1920
# Scala-IDE specific
2021
.scala_dependencies

CHANGELOG.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,39 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1313
- SECURITY.md with vulnerability reporting guidelines
1414
- CONTRIBUTING.md with development guidelines
1515
- Test suite fixes for Scala 2.12/2.13 compatibility
16+
- Spherical K-Means / cosine divergence support across estimators and models
17+
- New estimators: Mini-Batch K-Means, DP-Means, Balanced K-Means, Constrained (semi-supervised) K-Means, Kernel K-Means, Agglomerative Bregman, Bregman mixture models (EM), and CoClustering following the Spark ML Estimator/Model pattern
18+
- Bregman-native k-means++ seeding plus executable examples for spherical k-means
19+
- Outlier detection scaffolding with distance- and trim-based detectors
20+
- Property-based kernel accuracy suites and a performance benchmark suite with JSON outputs
21+
- **RobustKMeans estimator** for outlier-resistant clustering with trim, noise_cluster, and m_estimator modes
22+
- **Test suites for new components** (108 new tests, 716 total):
23+
- OutlierDetectionSuite: 16 tests for distance-based and trimmed outlier detection
24+
- SparseBregmanKernelSuite: 28 tests for sparse-optimized SE, KL, L1, Spherical kernels
25+
- ConstraintsSuite: 30 tests for must-link/cannot-link constraints and penalty computation
26+
- ConstrainedKMeansSuite: 17 tests for semi-supervised clustering with soft/hard constraints
27+
- RobustKMeansSuite: 17 tests for robust clustering with outlier handling and persistence
1628

1729
### Fixed
1830
- Package name conflicts in StreamingKMeans and XMeans test suites
1931
- Scala 2.12 compatibility issues with `isFinite` method
32+
- SparseSEKernel divergenceSparse missing 0.5 factor (now matches SquaredEuclideanKernel)
33+
- AgglomerativeBregmanModel persistence serializing IntParam object instead of value
2034
- Spark 3.4 compatibility issues with `model.summary` API
2135
- CollectionConverters imports for cross-version support
36+
- BLAS `doMax` comparison, division-by-zero guards in strategies and co-clustering initializer, and invalid javac option in `build.sbt`
37+
38+
### Changed
39+
- Unified divergence math via `BregmanFunction` and refactored kernel factory for consistency
40+
- Added Bregman-native initialization path and enriched Scaladoc across major estimators
41+
- Enhanced clustering iterator and constraint frameworks to support new variants
42+
43+
### Performance
44+
- Accelerated squared-Euclidean assignment and Elkan-style cross-iteration bounds for Lloyd's iterations
45+
- Vectorized BLAS helpers for common linear algebra operations
46+
47+
### Removed
48+
- Legacy RDD API and associated coreset/transform modules (DataFrame/ML API is now the sole surface)
2249

2350
## [0.6.0] - 2025-10-18
2451

CLAUDE.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@
2121
3. **Mathematical fidelity first.** Correct Bregman formulations beat micro-perf. Perf changes must not alter semantics.
2222
4. **Determinism matters.** Same seed ⇒ identical results. Avoid nondeterministic ops in core loops.
2323
5. **Tight PRs.** Small, test-backed, CI-friendly. No speculative abstractions.
24-
6. **Maintain the roadmap.** When making changes, update `ROADMAP.md` to reflect completed work, new issues discovered, or priority changes.
24+
6. **Maintain the roadmap.** Keep `ROADMAP.md` forward-looking; move completed work into `CHANGELOG.md` and leave the roadmap focused on upcoming items and priorities.
25+
7. **Use shared model helpers.** Clustering models should mix in `HasTrainingSummary` and `CentroidModelHelpers` for consistent summaries/metadata; avoid reintroducing ad-hoc summary fields.
2526

2627
---
2728

@@ -37,12 +38,11 @@
3738
**Claude must:**
3839
1. **Inspect `ROADMAP.md`** at the start of significant work to understand current priorities and context.
3940
2. **Update `ROADMAP.md`** when:
40-
- Completing a bug fix → mark as ✅ FIXED with date
41-
- Discovering a new bug → add to Bug Fixes section with priority
42-
- Completing a feature → move to Completed Items section
43-
- Identifying technical debt → add to appropriate section
41+
- Discovering a new bug or opportunity → add with priority
42+
- Changing priorities → update ordering/sections
4443
- Making architectural decisions → add to Decision Log
45-
3. **Reference roadmap items** in commit messages and PR descriptions where applicable.
44+
3. **Move completed work** (features, fixes, docs, perf) into `CHANGELOG.md` with dates and drop it from `ROADMAP.md` so the roadmap stays forward-looking.
45+
4. **Reference roadmap items** in commit messages and PR descriptions where applicable.
4646

4747
---
4848

0 commit comments

Comments
 (0)