Skip to content

Commit 8877c87

Browse files
derrickburnsclaude
andcommitted
feat: Add SparseKMeans estimator for high-dimensional sparse data
Implements SparseKMeans estimator with: - Auto-sparsity detection based on data sampling - Support for SE, KL, L1, and Spherical sparse kernels - sparseMode parameter: "auto", "force", "dense" - sparseThreshold parameter for auto-mode cutoff - Full persistence support (save/load) - 21 comprehensive tests Also fixes KernelFactory.supportsSparse() case sensitivity for correct auto-detection of sparse-eligible divergences. Updates ROADMAP to mark Sparse Bregman clustering as complete. Test suite now at 737 tests (all passing). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent a1ae3cf commit 8877c87

File tree

5 files changed

+1117
-7
lines changed

5 files changed

+1117
-7
lines changed

CHANGELOG.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,27 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1919
- Outlier detection scaffolding with distance- and trim-based detectors
2020
- Property-based kernel accuracy suites and a performance benchmark suite with JSON outputs
2121
- **RobustKMeans estimator** for outlier-resistant clustering with trim, noise_cluster, and m_estimator modes
22-
- **Test suites for new components** (108 new tests, 716 total):
22+
- **SparseKMeans estimator** for high-dimensional sparse data with auto-sparsity detection (21 tests)
23+
- Automatic sparse kernel selection based on data sparsity ratio
24+
- Support for SE, KL, L1, and Spherical divergences with sparse optimization
25+
- `sparseMode` parameter: "auto", "force", or "dense"
26+
- `sparseThreshold` parameter for auto-mode sparsity cutoff
27+
- **KernelFactory** for unified dense/sparse kernel creation with clear API
28+
- Single entry point for all 8 Bregman divergences
29+
- Auto-selection based on data sparsity with `forSparsity()` method
30+
- Canonical divergence name constants in `KernelFactory.Divergence`
31+
- **Test suites for new components** (129 new tests, 737 total):
2332
- OutlierDetectionSuite: 16 tests for distance-based and trimmed outlier detection
2433
- SparseBregmanKernelSuite: 28 tests for sparse-optimized SE, KL, L1, Spherical kernels
2534
- ConstraintsSuite: 30 tests for must-link/cannot-link constraints and penalty computation
2635
- ConstrainedKMeansSuite: 17 tests for semi-supervised clustering with soft/hard constraints
2736
- RobustKMeansSuite: 17 tests for robust clustering with outlier handling and persistence
37+
- SparseKMeansSuite: 21 tests for sparse clustering with auto-detection and persistence
38+
39+
### Architecture
40+
- Moved AcceleratedSEAssignment to `strategies/impl/` subpackage for better organization
41+
- Added type aliases in package objects for backward compatibility
42+
- Models now use KernelFactory for kernel creation (reduces code duplication)
2843

2944
### Fixed
3045
- Package name conflicts in StreamingKMeans and XMeans test suites

ROADMAP.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Roadmap: Generalized K-Means Clustering
22

3-
> **Last Updated:** 2025-12-15
3+
> **Last Updated:** 2025-12-16
44
> **Status:** Forward-looking (completed work now lives in `CHANGELOG.md`)
55
> **Maintainer Note:** Keep this document for upcoming work only; ship-ready or finished items belong in the changelog.
66
@@ -23,8 +23,8 @@ This roadmap tracks upcoming improvements. It is organized by time horizon and p
2323

2424
Goal: land the highest-demand capabilities and supporting docs.
2525

26-
- **Robust Bregman clustering + outlier handling** (3.11 / 5.8) — finalize trimmed/noise-cluster strategies, expose `outlierFraction`/`outlierMode`, add scoring column + persistence.
27-
- **Sparse Bregman clustering** (3.12) — finish `SparseKMeans` estimator and sparse-aware update strategy on top of `SparseBregmanKernel`.
26+
- ~~**Robust Bregman clustering + outlier handling** (3.11 / 5.8)~~**DONE**: `RobustKMeans` with trim/noise_cluster/m_estimator modes, outlier scoring, persistence.
27+
- ~~**Sparse Bregman clustering** (3.12)~~**DONE**: `SparseKMeans` estimator with auto-sparsity detection, `KernelFactory` for unified kernel creation.
2828
- **Multi-view clustering** (3.13 / 5.9) — implement `MultiViewKMeans` with shared `MultiViewAssignment`, per-view weights/divergences.
2929
- **Docs & notebooks** (6.1) — quick-start notebook, divergence selection guide, X-Means auto-k demo, soft-clustering interpretation examples.
3030

@@ -56,7 +56,7 @@ These frameworks unblock multiple roadmap items; prefer delivering them before d
5656

5757
| Component | Priority | Enables | Notes |
5858
|-----------|----------|---------|-------|
59-
| Outlier Detection (5.8) | P1 | Robust Bregman clustering (3.11) | Trim/noise-cluster strategies, scoring column |
59+
| ~~Outlier Detection (5.8)~~ | ~~P1~~ | ~~Robust Bregman clustering (3.11)~~ | **DONE**: Trim/noise-cluster strategies, scoring column |
6060
| Multi-View (5.9) | P1 | Multi-view clustering (3.13) | View specs, weights, divergences |
6161
| Sequence Kernels (5.10) | P2 | Time-series clustering (3.15) | DTW/shape kernels, barycenters |
6262
| Consensus (5.11) | P2 | Ensemble clustering (3.16) | Base generator + co-association |
@@ -85,6 +85,8 @@ These frameworks unblock multiple roadmap items; prefer delivering them before d
8585
| 2025-12-15 | Prioritize robust/sparse/multi-view work next | Highest user demand and unlocks downstream variants |
8686
| 2025-12-15 | Maintain kernels in a single module (`BregmanKernel.scala`) | Consistency and discoverability |
8787
| 2025-12-15 | Use phased delivery for accelerations and new iterators | Keep CI stable while iterating |
88+
| 2025-12-16 | Created `KernelFactory` for unified kernel creation | Single API for dense/sparse kernels, reduces duplication |
89+
| 2025-12-16 | Moved assignment strategies to `impl/` subpackage | Better organization, backward-compatible via type aliases |
8890

8991
---
9092

0 commit comments

Comments
 (0)