Skip to content

Commit 4650b04

Browse files
Update documentation (#81)
1 parent d099f66 commit 4650b04

16 files changed

+287
-185
lines changed

doc/api.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,5 +40,6 @@ Robust
4040
:toctree: generated/
4141
:template: class.rst
4242

43-
robust.RobustWeightedEstimator
44-
43+
robust.RobustWeightedClassifier
44+
robust.RobustWeightedRegressor
45+
robust.RobustWeightedKMeans

doc/modules/cluster.rst

Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
.. _cluster:
2+
3+
=====================================================
4+
Clustering with KMedoids and Common-nearest-neighbors
5+
=====================================================
6+
.. _k_medoids:
7+
8+
K-Medoids
9+
=========
10+
11+
:class:`KMedoids` is related to the :class:`KMeans` algorithm. While
12+
:class:`KMeans` tries to minimize the within cluster sum-of-squares,
13+
:class:`KMedoids` tries to minimize the sum of distances between each point and
14+
the medoid of its cluster. The medoid is a data point (unlike the centroid)
15+
which has least total distance to the other members of its cluster. The use of
16+
a data point to represent each cluster's center allows the use of any distance
17+
metric for clustering. It may also be a practical advantage, for instance K-Medoids
18+
algorithms have been used for facial recognition for which the medoid is a
19+
typical photo of the person to recognize while K-Means would have obtained a blurry
20+
image that mixed several pictures of the person to recognize.
21+
22+
:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans`
23+
as it will choose one of the cluster members as the medoid while
24+
:class:`KMeans` will move the center of the cluster towards the outlier which
25+
might in turn move other points away from the cluster centre.
26+
27+
:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans`
28+
except that the Manhattan Median is used for each cluster center instead of
29+
the centroid. K-Medians is robust to outliers, but it is limited to the
30+
Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee
31+
that the center of each cluster will be a member of the original dataset.
32+
33+
The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number
34+
of samples, :math:`T` is the number of iterations and :math:`K` is the number of
35+
clusters. This makes it more suitable for smaller datasets in comparison to
36+
:class:`KMeans` which is :math:`O(N K T)`.
37+
38+
.. topic:: Examples:
39+
40+
* :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits
41+
with various distance metrics.
42+
43+
44+
**Algorithm description:**
45+
There are several algorithms to compute K-Medoids, though :class:`KMedoids`
46+
currently only supports K-Medoids solver analogous to K-Means called alternate
47+
and the algorithm PAM (partitioning around medoids). Alternate algorithm is used
48+
when speed is an issue.
49+
50+
51+
* Alternate method works as follows:
52+
53+
* Initialize: Select ``n_clusters`` from the dataset as the medoids using
54+
a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter).
55+
* Assignment step: assign each element from the dataset to the closest medoid.
56+
* Update step: Identify the new medoid of each cluster.
57+
* Repeat the assignment and update step while the medoids keep changing or
58+
maximum number of iterations ``max_iter`` is reached.
59+
60+
* PAM method works as follows:
61+
62+
* Initialize: Greedy initialization of ``n_clusters``. First select the point
63+
in the dataset that minimize the sum of distances to a point. Then, add one
64+
point that minimize the cost and loop until ``n_clusters`` point are selected.
65+
This is the ``init`` parameter called ``build``.
66+
* Swap Step: for all medoids already selected, compute the cost of swaping this
67+
medoid with any non-medoid point. Then, make the swap that decrease the cost
68+
the moste. Loop and stop when there is no change anymore.
69+
70+
.. topic:: References:
71+
72+
* Maranzana, F.E., 1963. On the location of supply points to minimize
73+
transportation costs. IBM Systems Journal, 2(2), pp.129-135.
74+
* Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
75+
clustering. Expert systems with applications, 36(2), pp.3336-3341.
76+
* Kaufman, L. and Rousseeuw, P.J. (2008). Partitioning Around Medoids (Program PAM).
77+
In Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw).
78+
doi:10.1002/9780470316801.ch2
79+
* Bhat, Aruna (2014).K-medoids clustering using partitioning around medoids
80+
for performing face recognition. International Journal of Soft Computing,
81+
Mathematics and Control, 3(3), pp 1-12.
82+
83+
.. _commonnn:
84+
85+
Common-nearest-neighbors clustering
86+
===================================
87+
88+
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
89+
provides an interface to density-based
90+
common-nearest-neighbors clustering. Density-based clustering identifies
91+
clusters as dense regions of high point density, separated by sparse
92+
regions of lower density. Common-nearest-neighbors clustering
93+
approximates local density as the number of shared (common) neighbors
94+
between two points with respect to a neighbor search radius. A density
95+
threshold (density criterion) is used – defined by the cluster
96+
parameters ``min_samples`` (number of common neighbors) and ``eps`` (search
97+
radius) – to distinguish high from low density. A high value of
98+
``min_samples`` and a low value of ``eps`` corresponds to high density.
99+
100+
As such the method is related to other density-based cluster algorithms
101+
like :class:`DBSCAN <sklearn.cluster.DBSCAN>` or Jarvis-Patrick. DBSCAN
102+
approximates local density as the number of points in the neighborhood
103+
of a single point. The Jarvis-Patrick algorithm uses the number of
104+
common neighbors shared by two points among the :math:`k` nearest neighbors.
105+
As these approaches each provide a different notion of how density is
106+
estimated from point samples, they can be used complementarily. Their
107+
relative suitability for a classification problem depends on the nature
108+
of the clustered data. Common-nearest-neighbors clustering (as
109+
density-based clustering in general) has the following advantages over
110+
other clustering techniques:
111+
112+
* The cluster result is deterministic. The same set of cluster
113+
parameters always leads to the same classification for a data set.
114+
A different ordering of the data set leads to a different ordering
115+
of the cluster assignment, but does not change the assignment
116+
qualitatively.
117+
* Little prior knowledge about the data is required, e.g. the number
118+
of resulting clusters does not need to be known beforehand (although
119+
cluster parameters need to be tuned to obtain a desired result).
120+
* Identified clusters are not restricted in their shape or size.
121+
* Points can be considered noise (outliers) if they do not fullfil
122+
the density criterion.
123+
124+
The common-nearest-neighbors algorithm tests the density criterion for
125+
pairs of neighbors (do they have at least ``min_samples`` points in the
126+
intersection of their neighborhoods at a radius ``eps``). Two points that
127+
fullfil this criterion are directly part of the same dense data region,
128+
i.e. they are *density reachable*. A *density connected* network of
129+
density reachable points (a connected component if density reachability
130+
is viewed as a graph structure) constitutes a separated dense region and
131+
therefore a cluster. Note, that for example in contrast to
132+
:class:`DBSCAN <sklearn.cluster.DBSCAN>` there is no differentiation in
133+
*core* (dense points) and *edge* points (points that are not dense
134+
themselves but neighbors of dense points). The assignment of points on
135+
the cluster rims to a cluster is possible, but can be ambiguous. The
136+
cluster result is returned as a 1D container of labels, i.e. a sequence
137+
of integers (zero-based) of length :math:`n` for a data set of :math:`n`
138+
points,
139+
denoting the assignment of points to a specific cluster. Noise is
140+
labeled with ``-1``. Valid clusters have at least two members. The
141+
clusters are not sorted by cluster member count. In same cases the
142+
algorithm tends to identify small clusters that can be filtered out
143+
manually.
144+
145+
.. topic:: Examples:
146+
147+
* :ref:`examples/cluster/plot_commonnn.py <sphx_glr_auto_examples_plot_commonnn.py>`
148+
Basic usage of the
149+
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
150+
* :ref:`examples/cluster/plot_commonnn_data_sets.py <sphx_glr_auto_examples_plot_commonnn_data_sets.py>`
151+
Common-nearest-neighbors clustering of toy data sets
152+
153+
.. topic:: Implementation:
154+
155+
The present implementation of the common-nearest-neighbors algorithm in
156+
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
157+
shares some
158+
commonalities with the current
159+
scikit-learn implementation of :class:`DBSCAN <sklearn.cluster.DBSCAN>`.
160+
It computes neighborhoods from points in bulk with
161+
:class:`NearestNeighbors <sklearn.neighbors.NearestNeighbors>` before
162+
the actual clustering. Consequently, to store the neighborhoods
163+
it requires memory on the order of
164+
:math:`O(n ⋅ n_n)` for :math:`n` points in the data set where :math:`n_n`
165+
is the
166+
average number of neighbors (which is proportional to ``eps``), that is at
167+
worst :math:`O(n^2)`. Depending on the input structure (dense or sparse
168+
points or similarity matrix) the additional memory demand varies.
169+
The clustering itself follows a
170+
breadth-first-search scheme, checking the density criterion at every
171+
node expansion. The linear time complexity is roughly proportional to
172+
the number of data points :math:`n`, the total number of neighbors :math:`N`
173+
and the value of ``min_samples``. For density-based clustering
174+
schemes with lower memory demand, also consider:
175+
176+
* :class:`OPTICS <sklearn.cluster.OPTICS>` – Density-based clustering
177+
related to DBSCAN using a ``eps`` value range.
178+
* `cnnclustering <https://pypi.org/project/cnnclustering/>`_ – A
179+
different implementation of common-nearest-neighbors clustering.
180+
181+
.. topic:: Notes:
182+
183+
* :class:`DBSCAN <sklearn.cluster.DBSCAN>` provides an option to
184+
specify data point weights with ``sample_weights``. This feature is
185+
experimentally at the moment for :class:`CommonNNClustering` as
186+
weights are not well defined for checking the common-nearest-neighbor
187+
density criterion. It should not be used in production, yet.
188+
189+
.. topic:: References:
190+
191+
* B. Keller, X. Daura, W. F. van Gunsteren "Comparing Geometric and
192+
Kinetic Cluster Algorithms for Molecular Simulation Data" J. Chem.
193+
Phys., 2010, 132, 074110.
194+
195+
* O. Lemke, B.G. Keller "Density-based Cluster Algorithms for the
196+
Identification of Core Sets" J. Chem. Phys., 2016, 145, 164104.
197+
198+
* O. Lemke, B.G. Keller "Common nearest neighbor clustering - a
199+
benchmark" Algorithms, 2018, 11, 19.

doc/user_guide.rst

Lines changed: 1 addition & 175 deletions
Original file line numberDiff line numberDiff line change
@@ -11,179 +11,5 @@ User guide
1111
:numbered:
1212

1313
modules/eigenpro.rst
14+
modules/cluster.rst
1415
modules/robust.rst
15-
16-
.. _k_medoids:
17-
18-
K-Medoids
19-
=========
20-
21-
:class:`KMedoids` is related to the :class:`KMeans` algorithm. While
22-
:class:`KMeans` tries to minimize the within cluster sum-of-squares,
23-
:class:`KMedoids` tries to minimize the sum of distances between each point and
24-
the medoid of its cluster. The medoid is a data point (unlike the centroid)
25-
which has least total distance to the other members of its cluster. The use of
26-
a data point to represent each cluster's center allows the use of any distance
27-
metric for clustering.
28-
29-
:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans`
30-
as it will choose one of the cluster members as the medoid while
31-
:class:`KMeans` will move the center of the cluster towards the outlier which
32-
might in turn move other points away from the cluster centre.
33-
34-
:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans`
35-
except that the Manhattan Median is used for each cluster center instead of
36-
the centroid. K-Medians is robust to outliers, but it is limited to the
37-
Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee
38-
that the center of each cluster will be a member of the original dataset.
39-
40-
The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number
41-
of samples, :math:`T` is the number of iterations and :math:`K` is the number of
42-
clusters. This makes it more suitable for smaller datasets in comparison to
43-
:class:`KMeans` which is :math:`O(N K T)`.
44-
45-
.. topic:: Examples:
46-
47-
* :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits
48-
with various distance metrics.
49-
50-
51-
**Algorithm description:**
52-
There are several algorithms to compute K-Medoids, though :class:`KMedoids`
53-
currently only supports K-Medoids solver analogous to K-Means. Other frequently
54-
used approach is partitioning around medoids (PAM) which is currently not
55-
implemented.
56-
57-
This version works as follows:
58-
59-
* Initialize: Select ``n_clusters`` from the dataset as the medoids using
60-
a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter).
61-
* Assignment step: assign each element from the dataset to the closest medoid.
62-
* Update step: Identify the new medoid of each cluster.
63-
* Repeat the assignment and update step while the medoids keep changing or
64-
maximum number of iterations ``max_iter`` is reached.
65-
66-
.. topic:: References:
67-
68-
* Maranzana, F.E., 1963. On the location of supply points to minimize
69-
transportation costs. IBM Systems Journal, 2(2), pp.129-135.
70-
* Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids
71-
clustering. Expert systems with applications, 36(2), pp.3336-3341.
72-
73-
.. _commonnn:
74-
75-
Common-nearest-neighbors clustering
76-
===================================
77-
78-
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
79-
provides an interface to density-based
80-
common-nearest-neighbors clustering. Density-based clustering identifies
81-
clusters as dense regions of high point density, separated by sparse
82-
regions of lower density. Common-nearest-neighbors clustering
83-
approximates local density as the number of shared (common) neighbors
84-
between two points with respect to a neighbor search radius. A density
85-
threshold (density criterion) is used – defined by the cluster
86-
parameters ``min_samples`` (number of common neighbors) and ``eps`` (search
87-
radius) – to distinguish high from low density. A high value of
88-
``min_samples`` and a low value of ``eps`` corresponds to high density.
89-
90-
As such the method is related to other density-based cluster algorithms
91-
like :class:`DBSCAN <sklearn.cluster.DBSCAN>` or Jarvis-Patrick. DBSCAN
92-
approximates local density as the number of points in the neighborhood
93-
of a single point. The Jarvis-Patrick algorithm uses the number of
94-
common neighbors shared by two points among the :math:`k` nearest neighbors.
95-
As these approaches each provide a different notion of how density is
96-
estimated from point samples, they can be used complementarily. Their
97-
relative suitability for a classification problem depends on the nature
98-
of the clustered data. Common-nearest-neighbors clustering (as
99-
density-based clustering in general) has the following advantages over
100-
other clustering techniques:
101-
102-
* The cluster result is deterministic. The same set of cluster
103-
parameters always leads to the same classification for a data set.
104-
A different ordering of the data set leads to a different ordering
105-
of the cluster assignment, but does not change the assignment
106-
qualitatively.
107-
* Little prior knowledge about the data is required, e.g. the number
108-
of resulting clusters does not need to be known beforehand (although
109-
cluster parameters need to be tuned to obtain a desired result).
110-
* Identified clusters are not restricted in their shape or size.
111-
* Points can be considered noise (outliers) if they do not fullfil
112-
the density criterion.
113-
114-
The common-nearest-neighbors algorithm tests the density criterion for
115-
pairs of neighbors (do they have at least ``min_samples`` points in the
116-
intersection of their neighborhoods at a radius ``eps``). Two points that
117-
fullfil this criterion are directly part of the same dense data region,
118-
i.e. they are *density reachable*. A *density connected* network of
119-
density reachable points (a connected component if density reachability
120-
is viewed as a graph structure) constitutes a separated dense region and
121-
therefore a cluster. Note, that for example in contrast to
122-
:class:`DBSCAN <sklearn.cluster.DBSCAN>` there is no differentiation in
123-
*core* (dense points) and *edge* points (points that are not dense
124-
themselves but neighbors of dense points). The assignment of points on
125-
the cluster rims to a cluster is possible, but can be ambiguous. The
126-
cluster result is returned as a 1D container of labels, i.e. a sequence
127-
of integers (zero-based) of length :math:`n` for a data set of :math:`n`
128-
points,
129-
denoting the assignment of points to a specific cluster. Noise is
130-
labeled with ``-1``. Valid clusters have at least two members. The
131-
clusters are not sorted by cluster member count. In same cases the
132-
algorithm tends to identify small clusters that can be filtered out
133-
manually.
134-
135-
.. topic:: Examples:
136-
137-
* :ref:`examples/cluster/plot_commonnn.py <sphx_glr_auto_examples_plot_commonnn.py>`
138-
Basic usage of the
139-
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
140-
* :ref:`examples/cluster/plot_commonnn_data_sets.py <sphx_glr_auto_examples_plot_commonnn_data_sets.py>`
141-
Common-nearest-neighbors clustering of toy data sets
142-
143-
.. topic:: Implementation:
144-
145-
The present implementation of the common-nearest-neighbors algorithm in
146-
:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>`
147-
shares some
148-
commonalities with the current
149-
scikit-learn implementation of :class:`DBSCAN <sklearn.cluster.DBSCAN>`.
150-
It computes neighborhoods from points in bulk with
151-
:class:`NearestNeighbors <sklearn.neighbors.NearestNeighbors>` before
152-
the actual clustering. Consequently, to store the neighborhoods
153-
it requires memory on the order of
154-
:math:`O(n ⋅ n_n)` for :math:`n` points in the data set where :math:`n_n`
155-
is the
156-
average number of neighbors (which is proportional to ``eps``), that is at
157-
worst :math:`O(n^2)`. Depending on the input structure (dense or sparse
158-
points or similarity matrix) the additional memory demand varies.
159-
The clustering itself follows a
160-
breadth-first-search scheme, checking the density criterion at every
161-
node expansion. The linear time complexity is roughly proportional to
162-
the number of data points :math:`n`, the total number of neighbors :math:`N`
163-
and the value of ``min_samples``. For density-based clustering
164-
schemes with lower memory demand, also consider:
165-
166-
* :class:`OPTICS <sklearn.cluster.OPTICS>` – Density-based clustering
167-
related to DBSCAN using a ``eps`` value range.
168-
* `cnnclustering <https://pypi.org/project/cnnclustering/>`_ – A
169-
different implementation of common-nearest-neighbors clustering.
170-
171-
.. topic:: Notes:
172-
173-
* :class:`DBSCAN <sklearn.cluster.DBSCAN>` provides an option to
174-
specify data point weights with ``sample_weights``. This feature is
175-
experimentally at the moment for :class:`CommonNNClustering` as
176-
weights are not well defined for checking the common-nearest-neighbor
177-
density criterion. It should not be used in production, yet.
178-
179-
.. topic:: References:
180-
181-
* B. Keller, X. Daura, W. F. van Gunsteren "Comparing Geometric and
182-
Kinetic Cluster Algorithms for Molecular Simulation Data" J. Chem.
183-
Phys., 2010, 132, 074110.
184-
185-
* O. Lemke, B.G. Keller "Density-based Cluster Algorithms for the
186-
Identification of Core Sets" J. Chem. Phys., 2016, 145, 164104.
187-
188-
* O. Lemke, B.G. Keller "Common nearest neighbor clustering - a
189-
benchmark" Algorithms, 2018, 11, 19.

examples/cluster/README.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
.. _clusgter_examples:
2+
3+
Cluster
4+
=======
5+
6+
Examples concerning the :mod:`sklearn_extra.kernel_methods.cluster` module.

0 commit comments

Comments
 (0)