Skip to content

Commit ec19f92

Browse files
derrickburnsclaude
andcommitted
feat: Add ClusteringMetrics for model selection and diagnostics
- Add ClusteringMetrics object with scalable silhouette score (exact and centroid-based approximation), inertia (WCSS), cluster sizes/balance, and elbow curve helper for finding optimal k - Add ClusteringMetricsResult case class with silhouetteScore, approximateSilhouetteScore, balanceRatio, sizeStdDev methods - Update ROADMAP to mark P0 items as complete: - Hamerly/Elkan pruning (was already implemented) - PySpark wrapper improvements (was already done) - Model selection & diagnostics (now complete) - Iteration history (was already in TrainingSummary) - Update CHANGELOG with ClusteringMetrics and test counts (854 total) 12 new tests for clustering metrics validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 169329b commit ec19f92

File tree

4 files changed

+636
-9
lines changed

4 files changed

+636
-9
lines changed

CHANGELOG.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,12 +58,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5858
- Eigendecomposition via power iteration with deflation
5959
- Nyström approximation for large-scale clustering (O(nm²) instead of O(n³))
6060
- Full persistence support (save/load)
61+
- **ClusteringMetrics** for model selection and diagnostics (12 tests)
62+
- Scalable silhouette score (exact and centroid-based approximation)
63+
- Inertia (WCSS) computation
64+
- Cluster sizes and balance metrics
65+
- Elbow curve helper for finding optimal k
6166
- **Documentation guides** in `docs/guides/`:
6267
- Quick Start Guide - get running in 5 minutes
6368
- Divergence Selection Guide - comprehensive decision flowchart and examples
6469
- X-Means Auto-K Demo - automatic cluster count selection with BIC/AIC
6570
- Soft Clustering Guide - interpreting probabilistic memberships
66-
- **Test suites for new components** (203 new tests, 842 total):
71+
- **Test suites for new components** (215 new tests, 854 total):
6772
- OutlierDetectionSuite: 16 tests for distance-based and trimmed outlier detection
6873
- SparseBregmanKernelSuite: 28 tests for sparse-optimized SE, KL, L1, Spherical kernels
6974
- ConstraintsSuite: 30 tests for must-link/cannot-link constraints and penalty computation
@@ -74,6 +79,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7479
- TimeSeriesKMeansSuite: 31 tests for DTW-based time-series clustering with persistence
7580
- InformationBottleneckSuite: 28 tests for IB clustering with MI utilities and persistence
7681
- SpectralClusteringSuite: 25 tests for spectral/graph-based clustering with persistence
82+
- ClusteringMetricsSuite: 12 tests for silhouette, inertia, and cluster balance metrics
7783

7884
### Architecture
7985
- Moved AcceleratedSEAssignment to `strategies/impl/` subpackage for better organization

ROADMAP.md

Lines changed: 49 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,47 @@ This roadmap tracks upcoming improvements. It is organized by time horizon and p
1919

2020
---
2121

22+
## Documentation (P0) — Top Priority
23+
24+
> **Framework:** Follow [Diátaxis](https://diataxis.fr/) — Tutorials, How-to guides, Reference, Explanation. Every code snippet must be runnable (CI-compiled or from `examples/`).
25+
26+
### For Humans
27+
28+
| Doc Type | Purpose | Format |
29+
|----------|---------|--------|
30+
| **README** | What it is, when to use it, minimal install + example | 2-5 min read |
31+
| **Tutorials** | First successful run end-to-end (Scala + PySpark) | 15-30 min, tiny dataset |
32+
| **How-tos** | Task recipes: cosine k-means, KL on probabilities, streaming, choosing k, sparse vectors, avoiding OOM | One task per page |
33+
| **Reference** | Every param with type/default/range, complexity, persistence, determinism | Exhaustive, skimmable |
34+
| **Explanation** | Geometry choices, preprocessing guidance, scaling tradeoffs, failure modes | Why it works / when it fails |
35+
36+
**Actions:**
37+
- [ ] Restructure docs into Diátaxis categories
38+
- [ ] Create `docs/tutorials/` with end-to-end examples
39+
- [ ] Create `docs/howto/` with task-based recipes
40+
- [ ] Create `docs/reference/` with exhaustive param docs
41+
- [ ] Create `docs/explanation/` with conceptual guides
42+
- [ ] Ensure all code snippets compile in CI
43+
44+
### For AI/LLM Consumption
45+
46+
| Item | Purpose |
47+
|------|---------|
48+
| `/llms.txt` | Index of key pages for LLM tools ([llmstxt.org](https://llmstxt.org/)) |
49+
| `params.json` | Machine-readable parameter reference (names, defaults, constraints) |
50+
| `algorithms.json` | Algorithm catalog with links to canonical docs |
51+
| Small, single-topic pages | Better retrieval chunking |
52+
| Consistent headings | `## Parameters`, `## Examples`, `## Performance`, `## Failure modes` |
53+
54+
**Actions:**
55+
- [ ] Add `/llms.txt` with page descriptions
56+
- [ ] Generate `docs/reference/params.json` from code
57+
- [ ] Generate `docs/reference/algorithms.json` catalog
58+
- [ ] Split large docs into single-topic pages
59+
- [ ] Standardize section headings across all docs
60+
61+
---
62+
2263
## Adoption & Distribution (P0) — Highest ROI
2364

2465
> **Insight:** More algorithms won't drive adoption if users can't easily install or trust the library. These items reduce friction and increase real-world usage more than any algorithm tweak.
@@ -39,7 +80,7 @@ This roadmap tracks upcoming improvements. It is organized by time horizon and p
3980
**Problem:** Squared-Euclidean at scale is the #1 use case; users care about cost and wall-clock.
4081

4182
**Actions:**
42-
- [ ] **Fast exact:** Hamerly/Elkan/Yinyang pruning for Lloyd's iterations
83+
- [x] **Fast exact:** Hamerly/Elkan/Yinyang pruning for Lloyd's iterations**DONE**: `ElkanLloydsIterator` with cross-iteration bounds, `AcceleratedSEAssignment` with triangle inequality pruning (13 tests)
4384
- [ ] **Fast approximate:** ANN-assisted assignment (LSH, KD-tree, ball tree)
4485
- [ ] Benchmark suite with published numbers (iterations/sec, speedup vs. baseline)
4586

@@ -49,21 +90,21 @@ This roadmap tracks upcoming improvements. It is organized by time horizon and p
4990

5091
**Actions:**
5192
- [ ] pip-installable wheel with pinned Spark/Scala compatibility
52-
- [ ] Type hints and docstrings for IDE support
53-
- [ ] Native-feeling PySpark examples (not just Scala translations)
54-
- [ ] Model save/load and full param support matching Scala API
93+
- [x] Type hints and docstrings for IDE support**DONE**: Full type hints in `kmeans.py`, comprehensive Google-style docstrings
94+
- [x] Native-feeling PySpark examples (not just Scala translations) — **DONE**: 5 example files (basic, KL divergence, weighted, finding optimal k, persistence)
95+
- [x] Model save/load and full param support matching Scala API**DONE**: Full persistence and TrainingSummary support
5596
- [ ] CI smoke tests for Python API
5697

5798
### 4. Model Selection & Diagnostics
5899

59100
**Problem:** Users ask "Is this clustering any good?" after fitting.
60101

61102
**Actions:**
62-
- [ ] Scalable silhouette score (Spark-native)
63-
- [ ] Elbow method helper (cost vs. k curve)
103+
- [x] Scalable silhouette score (Spark-native) — **DONE**: `ClusteringMetrics` with exact and approximate silhouette (12 tests)
104+
- [x] Elbow method helper (cost vs. k curve)**DONE**: `ClusteringMetrics.elbowCurve()` method
64105
- [ ] Stability/bootstrap metrics
65-
- [ ] Iteration history: objective per iter, convergence reason, cluster sizes
66-
- [ ] First-class `ModelSummary` with JSON persistence
106+
- [x] Iteration history: objective per iter, convergence reason, cluster sizes**DONE**: `TrainingSummary` with distortionHistory, movementHistory, convergenceReport (8 tests)
107+
- [x] First-class `ModelSummary` with JSON persistence**DONE**: `TrainingSummary.toDF()` for DataFrame/JSON export
67108

68109
### 5. Production Features (Surface Existing Work)
69110

0 commit comments

Comments
 (0)