You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Add ClusteringMetrics for model selection and diagnostics
- Add ClusteringMetrics object with scalable silhouette score (exact and
centroid-based approximation), inertia (WCSS), cluster sizes/balance,
and elbow curve helper for finding optimal k
- Add ClusteringMetricsResult case class with silhouetteScore,
approximateSilhouetteScore, balanceRatio, sizeStdDev methods
- Update ROADMAP to mark P0 items as complete:
- Hamerly/Elkan pruning (was already implemented)
- PySpark wrapper improvements (was already done)
- Model selection & diagnostics (now complete)
- Iteration history (was already in TrainingSummary)
- Update CHANGELOG with ClusteringMetrics and test counts (854 total)
12 new tests for clustering metrics validation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: ROADMAP.md
+49-8Lines changed: 49 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,6 +19,47 @@ This roadmap tracks upcoming improvements. It is organized by time horizon and p
19
19
20
20
---
21
21
22
+
## Documentation (P0) — Top Priority
23
+
24
+
> **Framework:** Follow [Diátaxis](https://diataxis.fr/) — Tutorials, How-to guides, Reference, Explanation. Every code snippet must be runnable (CI-compiled or from `examples/`).
25
+
26
+
### For Humans
27
+
28
+
| Doc Type | Purpose | Format |
29
+
|----------|---------|--------|
30
+
|**README**| What it is, when to use it, minimal install + example | 2-5 min read |
31
+
|**Tutorials**| First successful run end-to-end (Scala + PySpark) | 15-30 min, tiny dataset |
32
+
|**How-tos**| Task recipes: cosine k-means, KL on probabilities, streaming, choosing k, sparse vectors, avoiding OOM | One task per page |
33
+
|**Reference**| Every param with type/default/range, complexity, persistence, determinism | Exhaustive, skimmable |
34
+
|**Explanation**| Geometry choices, preprocessing guidance, scaling tradeoffs, failure modes | Why it works / when it fails |
35
+
36
+
**Actions:**
37
+
-[ ] Restructure docs into Diátaxis categories
38
+
-[ ] Create `docs/tutorials/` with end-to-end examples
39
+
-[ ] Create `docs/howto/` with task-based recipes
40
+
-[ ] Create `docs/reference/` with exhaustive param docs
41
+
-[ ] Create `docs/explanation/` with conceptual guides
42
+
-[ ] Ensure all code snippets compile in CI
43
+
44
+
### For AI/LLM Consumption
45
+
46
+
| Item | Purpose |
47
+
|------|---------|
48
+
|`/llms.txt`| Index of key pages for LLM tools ([llmstxt.org](https://llmstxt.org/)) |
> **Insight:** More algorithms won't drive adoption if users can't easily install or trust the library. These items reduce friction and increase real-world usage more than any algorithm tweak.
@@ -39,7 +80,7 @@ This roadmap tracks upcoming improvements. It is organized by time horizon and p
39
80
**Problem:** Squared-Euclidean at scale is the #1 use case; users care about cost and wall-clock.
40
81
41
82
**Actions:**
42
-
-[]**Fast exact:** Hamerly/Elkan/Yinyang pruning for Lloyd's iterations
83
+
-[x]**Fast exact:** Hamerly/Elkan/Yinyang pruning for Lloyd's iterations — **DONE**: `ElkanLloydsIterator` with cross-iteration bounds, `AcceleratedSEAssignment` with triangle inequality pruning (13 tests)
0 commit comments