You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Major release with new algorithms and performance improvements
New algorithms:
- DPMeans: Bayesian nonparametric clustering with automatic k selection
- CoClustering: ML Estimator/Model pattern for simultaneous row/column clustering
- SphericalKernel: Cosine similarity support for text/embedding clustering
Performance improvements:
- ElkanLloydsIterator: Triangle inequality acceleration for SE (10-50x speedup)
- AcceleratedSEAssignment: Center-distance pruning for single iterations
- AdaptiveBroadcastAssignment: Memory-aware broadcast chunk sizing
- Vectorized BLAS: Native nrm2, squaredNorm, asum, normalize operations
Architecture:
- BregmanFunction: Unified trait as single source of truth for divergences
- BregmanFunctionAdapter: Bridges to both RDD and DataFrame APIs
Bug fixes:
- BLAS.doMax comparison operator (was computing minimum)
- Division by zero guards in Strategies.scala and CoClusteringInitializer
- build.sbt javac version format ("17.0" -> "17")
Documentation:
- Comprehensive Scaladoc for all 5 estimators
- SphericalKMeansExample with executable assertions
- ROADMAP.md tracking planned improvements
Tests: 942 tests passing (added ~200 new tests)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|**Test Matrix**| 730 tests across all Scala/Spark combinations<br/>• 62 kernel accuracy tests (divergence formulas, gradients, inverse gradients)<br/>• 19 Lloyd's iterator tests (core k-means loop)<br/>• Determinism, edge cases, numerical stability | Part of main CI |
72
-
|**Executable Documentation**| All examples run with assertions that verify correctness ([ExamplesSuite](src/test/scala/examples/ExamplesSuite.scala)):<br/>• [BisectingExample](src/main/scala/examples/BisectingExample.scala) - validates cluster count<br/>• [SoftKMeansExample](src/main/scala/examples/SoftKMeansExample.scala) - validates probability columns<br/>• [XMeansExample](src/main/scala/examples/XMeansExample.scala) - validates automatic k selection<br/>• [PersistenceRoundTrip](src/main/scala/examples/PersistenceRoundTrip.scala) - validates save/load with center accuracy<br/>• [PersistenceRoundTripKMedoids](src/main/scala/examples/PersistenceRoundTripKMedoids.scala) - validates medoid preservation | Part of main CI |
72
+
|**Executable Documentation**| All examples run with assertions that verify correctness ([ExamplesSuite](src/test/scala/examples/ExamplesSuite.scala)):<br/>• [BisectingExample](src/main/scala/examples/BisectingExample.scala) - validates cluster count<br/>• [SoftKMeansExample](src/main/scala/examples/SoftKMeansExample.scala) - validates probability columns<br/>• [XMeansExample](src/main/scala/examples/XMeansExample.scala) - validates automatic k selection<br/>• [SphericalKMeansExample](src/main/scala/examples/SphericalKMeansExample.scala) - validates cosine similarity clustering<br/>• [PersistenceRoundTrip](src/main/scala/examples/PersistenceRoundTrip.scala) - validates save/load with center accuracy<br/>• [PersistenceRoundTripKMedoids](src/main/scala/examples/PersistenceRoundTripKMedoids.scala) - validates medoid preservation | Part of main CI |
73
73
|**Cross-version Persistence**| Models save/load across Scala 2.12↔2.13 and Spark 3.4↔3.5↔4.0 | Part of main CI |
74
74
|**Performance Sanity**| Basic performance regression check (30s budget) | Part of main CI |
75
75
|**Python Smoke Test**| PySpark wrapper with both SE and non-SE divergences | Part of main CI |
@@ -92,16 +92,17 @@ Truth-linked to code, tests, and examples for full transparency:
92
92
|**Streaming K-Means**| ✅ |[Code](src/main/scala/com/massivedatascience/clusterer/ml/StreamingKMeans.scala)|[Tests](src/test/scala/com/massivedatascience/clusterer/StreamingKMeansSuite.scala)|[Persistence](src/main/scala/examples/PersistenceRoundTripStreamingKMeans.scala)| Real-time with exponential forgetting |
0 commit comments