|
| 1 | +.. _cluster: |
| 2 | + |
| 3 | +===================================================== |
| 4 | +Clustering with KMedoids and Common-nearest-neighbors |
| 5 | +===================================================== |
| 6 | +.. _k_medoids: |
| 7 | + |
| 8 | +K-Medoids |
| 9 | +========= |
| 10 | + |
| 11 | +:class:`KMedoids` is related to the :class:`KMeans` algorithm. While |
| 12 | +:class:`KMeans` tries to minimize the within cluster sum-of-squares, |
| 13 | +:class:`KMedoids` tries to minimize the sum of distances between each point and |
| 14 | +the medoid of its cluster. The medoid is a data point (unlike the centroid) |
| 15 | +which has least total distance to the other members of its cluster. The use of |
| 16 | +a data point to represent each cluster's center allows the use of any distance |
| 17 | +metric for clustering. It may also be a practical advantage, for instance K-Medoids |
| 18 | +algorithms have been used for facial recognition for which the medoid is a |
| 19 | +typical photo of the person to recognize while K-Means would have obtained a blurry |
| 20 | +image that mixed several pictures of the person to recognize. |
| 21 | + |
| 22 | +:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans` |
| 23 | +as it will choose one of the cluster members as the medoid while |
| 24 | +:class:`KMeans` will move the center of the cluster towards the outlier which |
| 25 | +might in turn move other points away from the cluster centre. |
| 26 | + |
| 27 | +:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans` |
| 28 | +except that the Manhattan Median is used for each cluster center instead of |
| 29 | +the centroid. K-Medians is robust to outliers, but it is limited to the |
| 30 | +Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee |
| 31 | +that the center of each cluster will be a member of the original dataset. |
| 32 | + |
| 33 | +The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number |
| 34 | +of samples, :math:`T` is the number of iterations and :math:`K` is the number of |
| 35 | +clusters. This makes it more suitable for smaller datasets in comparison to |
| 36 | +:class:`KMeans` which is :math:`O(N K T)`. |
| 37 | + |
| 38 | +.. topic:: Examples: |
| 39 | + |
| 40 | + * :ref:`sphx_glr_auto_examples_plot_kmedoids_digits.py`: Applying K-Medoids on digits |
| 41 | + with various distance metrics. |
| 42 | + |
| 43 | + |
| 44 | +**Algorithm description:** |
| 45 | +There are several algorithms to compute K-Medoids, though :class:`KMedoids` |
| 46 | +currently only supports K-Medoids solver analogous to K-Means called alternate |
| 47 | +and the algorithm PAM (partitioning around medoids). Alternate algorithm is used |
| 48 | +when speed is an issue. |
| 49 | + |
| 50 | + |
| 51 | +* Alternate method works as follows: |
| 52 | + |
| 53 | + * Initialize: Select ``n_clusters`` from the dataset as the medoids using |
| 54 | + a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter). |
| 55 | + * Assignment step: assign each element from the dataset to the closest medoid. |
| 56 | + * Update step: Identify the new medoid of each cluster. |
| 57 | + * Repeat the assignment and update step while the medoids keep changing or |
| 58 | + maximum number of iterations ``max_iter`` is reached. |
| 59 | + |
| 60 | +* PAM method works as follows: |
| 61 | + |
| 62 | + * Initialize: Greedy initialization of ``n_clusters``. First select the point |
| 63 | + in the dataset that minimize the sum of distances to a point. Then, add one |
| 64 | + point that minimize the cost and loop until ``n_clusters`` point are selected. |
| 65 | + This is the ``init`` parameter called ``build``. |
| 66 | + * Swap Step: for all medoids already selected, compute the cost of swaping this |
| 67 | + medoid with any non-medoid point. Then, make the swap that decrease the cost |
| 68 | + the moste. Loop and stop when there is no change anymore. |
| 69 | + |
| 70 | +.. topic:: References: |
| 71 | + |
| 72 | + * Maranzana, F.E., 1963. On the location of supply points to minimize |
| 73 | + transportation costs. IBM Systems Journal, 2(2), pp.129-135. |
| 74 | + * Park, H.S. and Jun, C.H., 2009. A simple and fast algorithm for K-medoids |
| 75 | + clustering. Expert systems with applications, 36(2), pp.3336-3341. |
| 76 | + * Kaufman, L. and Rousseeuw, P.J. (2008). Partitioning Around Medoids (Program PAM). |
| 77 | + In Finding Groups in Data (eds L. Kaufman and P.J. Rousseeuw). |
| 78 | + doi:10.1002/9780470316801.ch2 |
| 79 | + * Bhat, Aruna (2014).K-medoids clustering using partitioning around medoids |
| 80 | + for performing face recognition. International Journal of Soft Computing, |
| 81 | + Mathematics and Control, 3(3), pp 1-12. |
| 82 | + |
| 83 | +.. _commonnn: |
| 84 | + |
| 85 | +Common-nearest-neighbors clustering |
| 86 | +=================================== |
| 87 | + |
| 88 | +:class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>` |
| 89 | +provides an interface to density-based |
| 90 | +common-nearest-neighbors clustering. Density-based clustering identifies |
| 91 | +clusters as dense regions of high point density, separated by sparse |
| 92 | +regions of lower density. Common-nearest-neighbors clustering |
| 93 | +approximates local density as the number of shared (common) neighbors |
| 94 | +between two points with respect to a neighbor search radius. A density |
| 95 | +threshold (density criterion) is used – defined by the cluster |
| 96 | +parameters ``min_samples`` (number of common neighbors) and ``eps`` (search |
| 97 | +radius) – to distinguish high from low density. A high value of |
| 98 | +``min_samples`` and a low value of ``eps`` corresponds to high density. |
| 99 | + |
| 100 | +As such the method is related to other density-based cluster algorithms |
| 101 | +like :class:`DBSCAN <sklearn.cluster.DBSCAN>` or Jarvis-Patrick. DBSCAN |
| 102 | +approximates local density as the number of points in the neighborhood |
| 103 | +of a single point. The Jarvis-Patrick algorithm uses the number of |
| 104 | +common neighbors shared by two points among the :math:`k` nearest neighbors. |
| 105 | +As these approaches each provide a different notion of how density is |
| 106 | +estimated from point samples, they can be used complementarily. Their |
| 107 | +relative suitability for a classification problem depends on the nature |
| 108 | +of the clustered data. Common-nearest-neighbors clustering (as |
| 109 | +density-based clustering in general) has the following advantages over |
| 110 | +other clustering techniques: |
| 111 | + |
| 112 | + * The cluster result is deterministic. The same set of cluster |
| 113 | + parameters always leads to the same classification for a data set. |
| 114 | + A different ordering of the data set leads to a different ordering |
| 115 | + of the cluster assignment, but does not change the assignment |
| 116 | + qualitatively. |
| 117 | + * Little prior knowledge about the data is required, e.g. the number |
| 118 | + of resulting clusters does not need to be known beforehand (although |
| 119 | + cluster parameters need to be tuned to obtain a desired result). |
| 120 | + * Identified clusters are not restricted in their shape or size. |
| 121 | + * Points can be considered noise (outliers) if they do not fullfil |
| 122 | + the density criterion. |
| 123 | + |
| 124 | +The common-nearest-neighbors algorithm tests the density criterion for |
| 125 | +pairs of neighbors (do they have at least ``min_samples`` points in the |
| 126 | +intersection of their neighborhoods at a radius ``eps``). Two points that |
| 127 | +fullfil this criterion are directly part of the same dense data region, |
| 128 | +i.e. they are *density reachable*. A *density connected* network of |
| 129 | +density reachable points (a connected component if density reachability |
| 130 | +is viewed as a graph structure) constitutes a separated dense region and |
| 131 | +therefore a cluster. Note, that for example in contrast to |
| 132 | +:class:`DBSCAN <sklearn.cluster.DBSCAN>` there is no differentiation in |
| 133 | +*core* (dense points) and *edge* points (points that are not dense |
| 134 | +themselves but neighbors of dense points). The assignment of points on |
| 135 | +the cluster rims to a cluster is possible, but can be ambiguous. The |
| 136 | +cluster result is returned as a 1D container of labels, i.e. a sequence |
| 137 | +of integers (zero-based) of length :math:`n` for a data set of :math:`n` |
| 138 | +points, |
| 139 | +denoting the assignment of points to a specific cluster. Noise is |
| 140 | +labeled with ``-1``. Valid clusters have at least two members. The |
| 141 | +clusters are not sorted by cluster member count. In same cases the |
| 142 | +algorithm tends to identify small clusters that can be filtered out |
| 143 | +manually. |
| 144 | + |
| 145 | +.. topic:: Examples: |
| 146 | + |
| 147 | + * :ref:`examples/cluster/plot_commonnn.py <sphx_glr_auto_examples_plot_commonnn.py>` |
| 148 | + Basic usage of the |
| 149 | + :class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>` |
| 150 | + * :ref:`examples/cluster/plot_commonnn_data_sets.py <sphx_glr_auto_examples_plot_commonnn_data_sets.py>` |
| 151 | + Common-nearest-neighbors clustering of toy data sets |
| 152 | + |
| 153 | +.. topic:: Implementation: |
| 154 | + |
| 155 | + The present implementation of the common-nearest-neighbors algorithm in |
| 156 | + :class:`CommonNNClustering <sklearn_extra.cluster.CommonNNClustering>` |
| 157 | + shares some |
| 158 | + commonalities with the current |
| 159 | + scikit-learn implementation of :class:`DBSCAN <sklearn.cluster.DBSCAN>`. |
| 160 | + It computes neighborhoods from points in bulk with |
| 161 | + :class:`NearestNeighbors <sklearn.neighbors.NearestNeighbors>` before |
| 162 | + the actual clustering. Consequently, to store the neighborhoods |
| 163 | + it requires memory on the order of |
| 164 | + :math:`O(n ⋅ n_n)` for :math:`n` points in the data set where :math:`n_n` |
| 165 | + is the |
| 166 | + average number of neighbors (which is proportional to ``eps``), that is at |
| 167 | + worst :math:`O(n^2)`. Depending on the input structure (dense or sparse |
| 168 | + points or similarity matrix) the additional memory demand varies. |
| 169 | + The clustering itself follows a |
| 170 | + breadth-first-search scheme, checking the density criterion at every |
| 171 | + node expansion. The linear time complexity is roughly proportional to |
| 172 | + the number of data points :math:`n`, the total number of neighbors :math:`N` |
| 173 | + and the value of ``min_samples``. For density-based clustering |
| 174 | + schemes with lower memory demand, also consider: |
| 175 | + |
| 176 | + * :class:`OPTICS <sklearn.cluster.OPTICS>` – Density-based clustering |
| 177 | + related to DBSCAN using a ``eps`` value range. |
| 178 | + * `cnnclustering <https://pypi.org/project/cnnclustering/>`_ – A |
| 179 | + different implementation of common-nearest-neighbors clustering. |
| 180 | + |
| 181 | +.. topic:: Notes: |
| 182 | + |
| 183 | + * :class:`DBSCAN <sklearn.cluster.DBSCAN>` provides an option to |
| 184 | + specify data point weights with ``sample_weights``. This feature is |
| 185 | + experimentally at the moment for :class:`CommonNNClustering` as |
| 186 | + weights are not well defined for checking the common-nearest-neighbor |
| 187 | + density criterion. It should not be used in production, yet. |
| 188 | + |
| 189 | +.. topic:: References: |
| 190 | + |
| 191 | + * B. Keller, X. Daura, W. F. van Gunsteren "Comparing Geometric and |
| 192 | + Kinetic Cluster Algorithms for Molecular Simulation Data" J. Chem. |
| 193 | + Phys., 2010, 132, 074110. |
| 194 | + |
| 195 | + * O. Lemke, B.G. Keller "Density-based Cluster Algorithms for the |
| 196 | + Identification of Core Sets" J. Chem. Phys., 2016, 145, 164104. |
| 197 | + |
| 198 | + * O. Lemke, B.G. Keller "Common nearest neighbor clustering - a |
| 199 | + benchmark" Algorithms, 2018, 11, 19. |
0 commit comments