@@ -23,9 +23,6 @@ export KMeans, KMedoids
23
23
# Define constants for easy referencing of packages
24
24
const MMI = MLJModelInterface
25
25
const Cl = Clustering
26
-
27
-
28
-
29
26
const PKG = " MLJClusteringInterface"
30
27
31
28
# ###
@@ -143,6 +140,7 @@ metadata_pkg.(
143
140
144
141
metadata_model (
145
142
KMeans,
143
+ human_name = " K-means clusterer" ,
146
144
input = MMI. Table (Continuous),
147
145
output = MMI. Table (Continuous),
148
146
weights = false ,
@@ -151,6 +149,7 @@ metadata_model(
151
149
152
150
metadata_model (
153
151
KMedoids,
152
+ human_name = " K-medoids clusterer" ,
154
153
input = MMI. Table (Continuous),
155
154
output = MMI. Table (Continuous),
156
155
weights = false ,
@@ -159,38 +158,49 @@ metadata_model(
159
158
"""
160
159
$(MMI. doc_header (KMeans))
161
160
161
+ [K-means](http://en.wikipedia.org/wiki/K_means) is a classical method for
162
+ clustering or vector quantization. It produces a fixed number of clusters,
163
+ each associated with a *center* (also known as a *prototype*), and each data
164
+ point is assigned to a cluster with the nearest center.
165
+
166
+ From a mathematical standpoint, K-means is a coordinate descent
167
+ algorithm that solves the following optimization problem:
168
+
169
+ ```math
170
+ \\ text{minimize} \\ \\ sum_{i=1}^n \\ | \\ mathbf{x}_i - \\ boldsymbol{\\ mu}_{z_i} \\ |^2 \\ \\ text{w.r.t.} \\ (\\ boldsymbol{\\ mu}, z)
171
+ ```
172
+ Here, ``\\ boldsymbol{\\ mu}_k`` is the center of the ``k``-th cluster, and
173
+ ``z_i`` is an index of the cluster for ``i``-th point ``\\ mathbf{x}_i``.
162
174
163
- `KMeans`: The K-Means algorithm finds K centroids corresponding to K clusters in
164
- the data. The clusters are assumed to be elliptical, should be used with a euclidean distance metric
165
175
166
176
# Training data
167
177
168
178
In MLJ or MLJBase, bind an instance `model` to data with
169
179
170
- mach = machine(model, X, y)
171
-
172
- Where
180
+ mach = machine(model, X)
173
181
174
- - `X`: is any table of input features (eg, a `DataFrame`) whose columns
175
- are of scitype `Continuous`; check the scitype with `schema(X)`
182
+ Here:
176
183
177
- - `y`: is the target, which can be any `AbstractVector` whose element
178
- scitype is `Count `; check the scitype with `schema(y)`
184
+ - `X` is any table of input features (eg, a `DataFrame`) whose columns
185
+ are of scitype `Continuous `; check column scitypes with `schema(X)`.
179
186
180
187
Train the machine using `fit!(mach, rows=...)`.
181
188
182
189
# Hyper-parameters
183
190
184
191
- `k=3`: The number of centroids to use in clustering.
185
- - `metric::SemiMetric=SqEuclidean`: The metric used to calculate the clustering distance
186
- matrix
192
+
193
+ - `metric::SemiMetric=Distances.SqEuclidean`: The metric used to calculate the
194
+ clustering. Must have type `PreMetric` from Distances.jl.
195
+
187
196
188
197
# Operations
189
198
190
- - `predict(mach, Xnew)`: return predictions of the target given new
199
+ - `predict(mach, Xnew)`: return cluster label assignments, given new
191
200
features `Xnew` having the same Scitype as `X` above.
201
+
192
202
- `transform(mach, Xnew)`: instead return the mean pairwise distances from
193
- new samples to the cluster centers
203
+ new samples to the cluster centers.
194
204
195
205
# Fitted parameters
196
206
@@ -203,72 +213,72 @@ The fields of `fitted_params(mach)` are:
203
213
The fields of `report(mach)` are:
204
214
205
215
- `assignments`: The cluster assignments of each point in the training data.
216
+
206
217
- `cluster_labels`: The labels assigned to each cluster.
207
218
208
219
# Examples
209
220
210
221
```
211
222
using MLJ
212
- using Distances
213
- using Test
214
223
KMeans = @load KMeans pkg=Clustering
215
224
216
- X, y = @load_iris
225
+ table = load_iris()
226
+ y, X = unpack(table, ==(:target), rng=123)
217
227
model = KMeans(k=3)
218
228
mach = machine(model, X) |> fit!
219
229
220
230
yhat = predict(mach, X)
221
- @test yhat == report(mach).assignments
231
+ @assert yhat == report(mach).assignments
222
232
223
233
compare = zip(yhat, y) |> collect;
224
234
compare[1:8] # clusters align with classes
225
235
226
236
center_dists = transform(mach, fitted_params(mach).centers')
227
237
228
- @test center_dists[1][1] == 0.0
229
- @test center_dists[2][2] == 0.0
230
- @test center_dists[3][3] == 0.0
238
+ @assert center_dists[1][1] == 0.0
239
+ @assert center_dists[2][2] == 0.0
240
+ @assert center_dists[3][3] == 0.0
231
241
```
232
242
233
243
See also
234
244
[`KMedoids`](@ref)
235
245
"""
236
246
KMeans
247
+
237
248
"""
238
249
$(MMI. doc_header (KMedoids))
239
250
240
- `KMedoids`: The K-Medoids algorithm finds K centroids corresponding to K clusters in the
241
- data. Unlike K-Means, the centroids are found among data points themselves. Clusters
242
- are not assumed to be elliptical. Should be used with a non-euclidean distance metric
251
+ [K-medoids](http://en.wikipedia.org/wiki/K-medoids) is a clustering algorithm that works by
252
+ finding ``k`` data points (called *medoids*) such that the total distance between each data
253
+ point and the closest *medoid* is minimal.
243
254
244
255
# Training data
245
256
246
257
In MLJ or MLJBase, bind an instance `model` to data with
247
258
248
- mach = machine(model, X, y)
249
-
250
- Where
259
+ mach = machine(model, X)
251
260
252
- - `X`: is any table of input features (eg, a `DataFrame`) whose columns
253
- are of scitype `Continuous`; check the scitype with `schema(X)`
261
+ Here:
254
262
255
- - `y`: is the target, which can be any `AbstractVector` whose element
256
- scitype is `Count `; check the scitype with `schema(y )`
263
+ - `X` is any table of input features (eg, a `DataFrame`) whose columns
264
+ are of scitype `Continuous `; check column scitypes with `schema(X )`
257
265
258
266
Train the machine using `fit!(mach, rows=...)`.
259
267
260
268
# Hyper-parameters
261
269
262
270
- `k=3`: The number of centroids to use in clustering.
263
- - `metric::SemiMetric=SqEuclidean`: The metric used to calculate the clustering distance
264
- matrix
271
+
272
+ - `metric::SemiMetric=Distances.SqEuclidean`: The metric used to calculate the
273
+ clustering. Must have type `PreMetric` from Distances.jl.
265
274
266
275
# Operations
267
276
268
- - `predict(mach, Xnew)`: return predictions of the target given new
277
+ - `predict(mach, Xnew)`: return cluster label assignments, given new
269
278
features `Xnew` having the same Scitype as `X` above.
279
+
270
280
- `transform(mach, Xnew)`: instead return the mean pairwise distances from
271
- new samples to the cluster centers
281
+ new samples to the cluster centers.
272
282
273
283
# Fitted parameters
274
284
@@ -281,32 +291,31 @@ The fields of `fitted_params(mach)` are:
281
291
The fields of `report(mach)` are:
282
292
283
293
- `assignments`: The cluster assignments of each point in the training data.
294
+
284
295
- `cluster_labels`: The labels assigned to each cluster.
285
296
286
297
# Examples
287
298
288
299
```
289
300
using MLJ
290
- using Test
291
301
KMedoids = @load KMedoids pkg=Clustering
292
302
293
- X, y = @load_iris
303
+ table = load_iris()
304
+ y, X = unpack(table, ==(:target), rng=123)
294
305
model = KMedoids(k=3)
295
306
mach = machine(model, X) |> fit!
296
307
297
308
yhat = predict(mach, X)
298
- @test yhat == report(mach).assignments
309
+ @assert yhat == report(mach).assignments
299
310
300
311
compare = zip(yhat, y) |> collect;
301
312
compare[1:8] # clusters align with classes
302
313
303
314
center_dists = transform(mach, fitted_params(mach).medoids')
304
315
305
- @test center_dists[1][1] == 0.0
306
- @test center_dists[2][2] == 0.0
307
- @test center_dists[3][3] == 0.0
308
-
309
- # we can also
316
+ @assert center_dists[1][1] == 0.0
317
+ @assert center_dists[2][2] == 0.0
318
+ @assert center_dists[3][3] == 0.0
310
319
```
311
320
312
321
See also
0 commit comments