Skip to content

Commit a313f4d

Browse files
derrickburnsclaude
andcommitted
docs: Add comprehensive Diátaxis-structured documentation
- Add tutorials section with first clustering, PySpark, algorithm selection guides - Add how-to guides for installation, finding optimal k, clustering probabilities, handling outliers - Add reference documentation for all 15 algorithms, 8 divergences, and complete parameters - Add explanation section covering Bregman divergences and performance tuning - Add llms.txt for AI/LLM documentation consumption - Add params.json machine-readable parameter reference - Update Jekyll config with Diátaxis navigation structure 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent ec19f92 commit a313f4d

20 files changed

+3825
-15
lines changed

docs/_config.yml

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,60 @@
11
title: Generalized K-Means Clustering
2-
description: A Scala library for generalized K-means clustering using Bregman divergences on Apache Spark
2+
description: Scalable clustering with Bregman divergences on Apache Spark
33
theme: jekyll-theme-minimal
44
markdown: kramdown
5-
highlighter: rouge
5+
highlighter: rouge
6+
7+
# Navigation
8+
header_pages:
9+
- index.md
10+
- tutorials/index.md
11+
- howto/index.md
12+
- reference/index.md
13+
- explanation/index.md
14+
15+
# Collections for organized content
16+
collections:
17+
tutorials:
18+
output: true
19+
permalink: /tutorials/:name/
20+
howto:
21+
output: true
22+
permalink: /howto/:name/
23+
reference:
24+
output: true
25+
permalink: /reference/:name/
26+
explanation:
27+
output: true
28+
permalink: /explanation/:name/
29+
30+
# Defaults
31+
defaults:
32+
- scope:
33+
path: ""
34+
type: "tutorials"
35+
values:
36+
layout: "default"
37+
- scope:
38+
path: ""
39+
type: "howto"
40+
values:
41+
layout: "default"
42+
- scope:
43+
path: ""
44+
type: "reference"
45+
values:
46+
layout: "default"
47+
- scope:
48+
path: ""
49+
type: "explanation"
50+
values:
51+
layout: "default"
52+
53+
# GitHub Pages
54+
baseurl: "/generalized-kmeans-clustering"
55+
url: "https://derrickburns.github.io"
56+
57+
# Plugins
58+
plugins:
59+
- jekyll-seo-tag
60+
- jekyll-sitemap
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
---
2+
title: Bregman Divergences
3+
---
4+
5+
# Bregman Divergences
6+
7+
Understanding the mathematical foundation of generalized k-means.
8+
9+
---
10+
11+
## What is a Bregman Divergence?
12+
13+
A Bregman divergence is a measure of "distance" between two points, defined by a strictly convex function φ (called the generator):
14+
15+
```
16+
D_φ(x, y) = φ(x) - φ(y) - ∇φ(y) · (x - y)
17+
```
18+
19+
**Intuition:** The divergence measures the difference between φ(x) and its linear approximation at y.
20+
21+
---
22+
23+
## Why Bregman Divergences?
24+
25+
### 1. Unique Mean Property
26+
27+
For any Bregman divergence, the point that minimizes the sum of divergences from a set of points is the **arithmetic mean**.
28+
29+
```
30+
argmin_c Σᵢ D_φ(xᵢ, c) = (1/n) Σᵢ xᵢ
31+
```
32+
33+
This is why k-means (with any Bregman divergence) uses simple averaging to update centers.
34+
35+
### 2. Natural for Exponential Families
36+
37+
Each Bregman divergence corresponds to a member of the exponential family of distributions:
38+
39+
| Divergence | Distribution | Natural for |
40+
|------------|--------------|-------------|
41+
| Squared Euclidean | Gaussian | Continuous data |
42+
| KL | Multinomial/Poisson | Counts, probabilities |
43+
| Itakura-Saito | Gamma | Power spectra |
44+
| Logistic | Bernoulli | Binary data |
45+
46+
### 3. Consistent Objective
47+
48+
The k-means objective (minimize within-cluster divergence) has the same form regardless of which Bregman divergence you use:
49+
50+
```
51+
minimize Σᵢ Σⱼ wᵢⱼ D_φ(xᵢ, μⱼ)
52+
```
53+
54+
---
55+
56+
## The Generator Function
57+
58+
Each divergence is fully specified by its generator φ:
59+
60+
### Squared Euclidean
61+
```
62+
φ(x) = ½||x||² = ½ Σᵢ xᵢ²
63+
∇φ(x) = x
64+
D_φ(x,y) = ½||x - y||²
65+
```
66+
67+
### KL Divergence
68+
```
69+
φ(x) = Σᵢ xᵢ log(xᵢ) (negative entropy)
70+
∇φ(x) = log(x) + 1
71+
D_φ(x,y) = Σᵢ xᵢ log(xᵢ/yᵢ) - xᵢ + yᵢ
72+
```
73+
74+
### Itakura-Saito
75+
```
76+
φ(x) = -Σᵢ log(xᵢ)
77+
∇φ(x) = -1/x
78+
D_φ(x,y) = Σᵢ (xᵢ/yᵢ - log(xᵢ/yᵢ) - 1)
79+
```
80+
81+
---
82+
83+
## Properties
84+
85+
### Non-negativity
86+
```
87+
D_φ(x, y) ≥ 0 with equality iff x = y
88+
```
89+
90+
### Convexity
91+
D_φ(x, y) is convex in x (but not necessarily in y).
92+
93+
### Asymmetry
94+
In general, D_φ(x, y) ≠ D_φ(y, x).
95+
96+
Squared Euclidean is the **only** symmetric Bregman divergence.
97+
98+
### Triangle Inequality
99+
Bregman divergences generally **do not** satisfy the triangle inequality (except squared Euclidean).
100+
101+
---
102+
103+
## Geometric Interpretation
104+
105+
### Squared Euclidean
106+
- Measures actual geometric distance
107+
- Centers are centroids (center of mass)
108+
- Produces spherical/convex clusters
109+
110+
### KL Divergence
111+
- Measures information difference
112+
- Centers are geometric means (in log space)
113+
- Natural for probability simplex
114+
115+
### Itakura-Saito
116+
- Scale-invariant (relative error matters)
117+
- Centers are harmonic means (in some sense)
118+
- Natural for multiplicative noise models
119+
120+
---
121+
122+
## Choosing the Right Divergence
123+
124+
The "right" divergence depends on:
125+
126+
1. **Data type:** What does your data represent?
127+
2. **Noise model:** What errors are expected?
128+
3. **Domain constraints:** Positive? Sums to 1?
129+
130+
### Decision Guide
131+
132+
```
133+
Is your data probability distributions?
134+
→ KL divergence
135+
136+
Is your data power spectra or variances?
137+
→ Itakura-Saito
138+
139+
Is scale important (absolute values matter)?
140+
→ Squared Euclidean
141+
142+
Is angle important (direction matters)?
143+
→ Cosine / Spherical
144+
145+
Do you have counts (not normalized)?
146+
→ Generalized I-divergence
147+
148+
Is your data binary probabilities?
149+
→ Logistic loss
150+
```
151+
152+
---
153+
154+
## Mathematical Foundations
155+
156+
### Connection to Exponential Families
157+
158+
For an exponential family with natural parameter θ and log-partition A(θ):
159+
160+
```
161+
p(x|θ) = h(x) exp(θ·x - A(θ))
162+
```
163+
164+
The corresponding Bregman divergence uses:
165+
```
166+
φ = A* (convex conjugate of A)
167+
```
168+
169+
This is why:
170+
- KL matches multinomial (log-partition = log Σ exp(θᵢ))
171+
- Squared Euclidean matches Gaussian (log-partition = ½||θ||²)
172+
173+
### Connection to Information Geometry
174+
175+
Bregman divergences define a **dually flat** geometry on the parameter space. The primal coordinates (means) and dual coordinates (natural parameters) are connected by the gradient of φ.
176+
177+
---
178+
179+
## References
180+
181+
1. Banerjee, A., et al. (2005). "Clustering with Bregman divergences." *JMLR*.
182+
2. Bregman, L. M. (1967). "The relaxation method of finding the common point of convex sets."
183+
184+
---
185+
186+
[Back to Explanation](index.html) | [Home](../)

docs/explanation/index.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
title: Explanation
3+
---
4+
5+
# Explanation
6+
7+
Conceptual guides to understand the theory and make informed decisions.
8+
9+
---
10+
11+
## Core Concepts
12+
13+
- [Bregman Divergences](bregman-divergences.html) — The mathematical foundation
14+
- [When to Use What](when-to-use.html) — Decision framework for divergences
15+
16+
## Architecture
17+
18+
- [How Lloyd's Algorithm Works](lloyds-algorithm.html) — The core iteration
19+
- [Assignment Strategies](assignment-strategies.html) — Cross-join vs broadcast
20+
21+
## Performance
22+
23+
- [Performance Tuning](performance.html) — Scaling to billions of points
24+
- [Acceleration Techniques](acceleration.html) — Elkan, Hamerly, mini-batch
25+
26+
## Advanced Topics
27+
28+
- [Soft vs Hard Clustering](soft-vs-hard.html) — Probabilistic assignments
29+
- [Cluster Validity](cluster-validity.html) — Evaluation metrics
30+
31+
---
32+
33+
[Back to Home](../)

0 commit comments

Comments
 (0)