|
134 | 134 | "source": [ |
135 | 135 | "## Transform the space\n", |
136 | 136 | "\n", |
137 | | - "To find clusters we want to find the islands of higher density amid a sea of sparser noise -- and the assumption of noise is important: real data is messy and has outliers, corrupt data, and noise. The core of our clustering algorithm is single linkage clustering, and it can be quite sensitive to noise: a single noise data point in the wrong place can act as a bridge between islands, gluing them together. Obviously we want our algorithm to be robust against noise so we need to find a way to help 'lower the sea level' before running a single linkage algorithm.\n", |
| 137 | + "To find clusters we want to find the islands of higher density amid a sea of sparser noise -- and the assumption of noise is important: real data is messy and has outliers, corrupt data, and noise. The core of the clustering algorithm is single linkage clustering, and it can be quite sensitive to noise: a single noise data point in the wrong place can act as a bridge between islands, gluing them together. Obviously we want our algorithm to be robust against noise so we need to find a way to help 'lower the sea level' before running a single linkage algorithm.\n", |
138 | 138 | "\n", |
139 | 139 | "How can we characterize 'sea' and 'land' without doing a clustering? As long as we can get an estimate of density we can consider lower density points as the 'sea'. The goal here is not to perfectly distinguish 'sea' from 'land' -- this is an initial step in clustering, not the ouput -- just to make our clustering core a little more robust to noise. So given an identification of 'sea' we want to lower the sea level. For practical purposes that means making 'sea' points more distant from each other and from the 'land'.\n", |
140 | 140 | "\n", |
|
175 | 175 | "source": [ |
176 | 176 | "## Build the minimum spanning tree\n", |
177 | 177 | "\n", |
178 | | - "Now that we have a new mutual reachability metric on the data we want start finding the islands on dense data. Of course dense areas are relative, and different islands may have different densities. Conceptually what we will do is the following: consider the data as a weighted graph with the data points as vertices and an edge between any two points with weight equal to the mutual reachability distance of those points; now consider a threshold value, starting high, and steadily being lowered where we drop any edges with weight above that threshold; as we drop edges we will start to disconnect the graph into connected components; eventually we will have a hierarchy of connected components (from completely connected to completely disconnected) at varying threshold levels. In practice this is very expensive: there are $n^2$ edges and we don't want to have to run a connected components that many times. The right thing to do is to find a minimal set of edges such that dropping any edge from the set causes a disconnection of components. But we need more, we need this set to be such that there is no lower weight edge that could connect the components. Fortunately graph theory furnishes us with just such a thing: the minimum spanning tree of the graph.\n", |
| 178 | + "Now that we have a new mutual reachability metric on the data we want start finding the islands on dense data. Of course dense areas are relative, and different islands may have different densities. Conceptually what we will do is the following: consider the data as a weighted graph with the data points as vertices and an edge between any two points with weight equal to the mutual reachability distance of those points.\n", |
179 | 179 | "\n", |
180 | | - "In practice we can build the minimum spanning tree very efficiently via [Prim's algorithm](https://en.wikipedia.org/wiki/Prim%27s_algorithm) -- we build the tree one edge at a time, always adding the lowest weight edge that connects the current tree to a vertex not yet in the tree. You can see the tree HDBSCAN constructed below; note that this is the minimum spanning tree for *mutual reachability distance* which is different from the pure distance in the graph. In this case we had a *k* value of 8." |
| 180 | + "Now consider a threshold value, starting high, and steadily being lowered. Drop any edges with weight above that threshold. As we drop edges we will start to disconnect the graph into connected components. Eventually we will have a hierarchy of connected components (from completely connected to completely disconnected) at varying threshold levels.\n", |
| 181 | + "\n", |
| 182 | + "In practice this is very expensive: there are $n^2$ edges and we don't want to have to run a connected components algorithm that many times. The right thing to do is to find a minimal set of edges such that dropping any edge from the set causes a disconnection of components. But we need more, we need this set to be such that there is no lower weight edge that could connect the components. Fortunately graph theory furnishes us with just such a thing: the minimum spanning tree of the graph.\n", |
| 183 | + "\n", |
| 184 | + "We can build the minimum spanning tree very efficiently via [Prim's algorithm](https://en.wikipedia.org/wiki/Prim%27s_algorithm) -- we build the tree one edge at a time, always adding the lowest weight edge that connects the current tree to a vertex not yet in the tree. You can see the tree HDBSCAN constructed below; note that this is the minimum spanning tree for *mutual reachability distance* which is different from the pure distance in the graph. In this case we had a *k* value of 5." |
181 | 185 | ] |
182 | 186 | }, |
183 | 187 | { |
|
218 | 222 | "source": [ |
219 | 223 | "## Build the cluster hierarchy\n", |
220 | 224 | "\n", |
221 | | - "Given the minimal spanning tree the next step is to convert that into the hierarchy of connected components. This is most easily done in the reverse order: sort the edges of the tree by distance (in increasing order) and then iterate through and creating a new merged cluster for each edge. The only difficult part here is to identify the two clusters each edge join together, but this is easy enough via a [union-find](https://en.wikipedia.org/wiki/Disjoint-set_data_structure) data structure. We can view the result as a dendrogram as we see below:" |
| 225 | + "Given the minimal spanning tree, the next step is to convert that into the hierarchy of connected components. This is most easily done in the reverse order: sort the edges of the tree by distance (in increasing order) and then iterate through, creating a new merged cluster for each edge. The only difficult part here is to identify the two clusters each edge will join together, but this is easy enough via a [union-find](https://en.wikipedia.org/wiki/Disjoint-set_data_structure) data structure. We can view the result as a dendrogram as we see below:" |
222 | 226 | ] |
223 | 227 | }, |
224 | 228 | { |
|
258 | 262 | "cell_type": "markdown", |
259 | 263 | "metadata": {}, |
260 | 264 | "source": [ |
261 | | - "This brings us to the point where robust single linkage stops. We want more though; a cluster hierarchy is good, but we really want a set of flat clusters. We could do that by drawing a a horizontal line through the above diagram and selecting the clusters that it cuts through. This is in practice what [DBSCAN](http://scikit-learn.org/stable/modules/clustering.html#dbscan) does (declaring any singleton clusters at the cut level as noise). The question is, how do we know where to draw that line? DBSCAN simply leaves that as a (very unintuitive) parameter. Worse, we really want to deal with variable density clusters and any choice of cut line is a choice of mutual reachability distance to cut at, and hence an effective density level. Ideally we want to be able to cut the tree at different places to select our clusters. This is where the next steps of HDBSCAN begin and create the difference from robust single linkage." |
| 265 | + "This brings us to the point where robust single linkage stops. We want more though; a cluster hierarchy is good, but we really want a set of flat clusters. We could do that by drawing a a horizontal line through the above diagram and selecting the clusters that it cuts through. This is in practice what [DBSCAN](http://scikit-learn.org/stable/modules/clustering.html#dbscan) effectively does (declaring any singleton clusters at the cut level as noise). The question is, how do we know where to draw that line? DBSCAN simply leaves that as a (very unintuitive) parameter. Worse, we really want to deal with variable density clusters and any choice of cut line is a choice of mutual reachability distance to cut at, and hence a single fixed density level. Ideally we want to be able to cut the tree at different places to select our clusters. This is where the next steps of HDBSCAN begin and create the difference from robust single linkage." |
262 | 266 | ] |
263 | 267 | }, |
264 | 268 | { |
|
267 | 271 | "source": [ |
268 | 272 | "## Condense the cluster tree\n", |
269 | 273 | "\n", |
270 | | - "The first step in cluster extraction is condensing down the large and complicated cluster hierarchy into a smaller tree with a little more data attached to each node. As you can see in the hierarchy above it is often the case that a cluster split is one or two points splitting off from a cluster; and that is the key point -- rather than seeing it as a cluster splitting into two new clusters we want to view it as a single persistent cluster that is 'losing points'. To make this concrete we need a notion of **minimum cluster size** which we take as a parameter to HDBSCAN. Once we have a value for minimum cluster size we can now walk through the hierarchy and at each split ask if one of the new clusters created by the split has fewer points than the minimum cluster size. If it is the case that we have fewer points than the minimum cluster size we declare it to be 'points falling out of a cluster' and have the larger cluster retain the cluster identity of the parent, marking down which points 'fell out of the cluster' and at what distance value that happened. If on the other hand the split is into two clusters each at least as large as the minimum cluster size then we consider that a true cluster split and let that split persist in the tree. After walking through the whole hierarchy and doing this we end up with a much smaller tree with a small number of nodes, each of which has data about how the size of the cluster at that node descreases over varying distance. We can visualize this as a dendrogram similar to the one above -- again we can have the width of the line represent the number of points in the cluster. This time, however, that width varies over the length of the line as points fall our of the cluster. For our data using a minimum cluster size of 10 the result looks like this:" |
| 274 | + "The first step in cluster extraction is condensing down the large and complicated cluster hierarchy into a smaller tree with a little more data attached to each node. As you can see in the hierarchy above it is often the case that a cluster split is one or two points splitting off from a cluster; and that is the key point -- rather than seeing it as a cluster splitting into two new clusters we want to view it as a single persistent cluster that is 'losing points'. To make this concrete we need a notion of **minimum cluster size** which we take as a parameter to HDBSCAN. Once we have a value for minimum cluster size we can now walk through the hierarchy and at each split ask if one of the new clusters created by the split has fewer points than the minimum cluster size. If it is the case that we have fewer points than the minimum cluster size we declare it to be 'points falling out of a cluster' and have the larger cluster retain the cluster identity of the parent, marking down which points 'fell out of the cluster' and at what distance value that happened. If on the other hand the split is into two clusters each at least as large as the minimum cluster size then we consider that a true cluster split and let that split persist in the tree. After walking through the whole hierarchy and doing this we end up with a much smaller tree with a small number of nodes, each of which has data about how the size of the cluster at that node descreases over varying distance. We can visualize this as a dendrogram similar to the one above -- again we can have the width of the line represent the number of points in the cluster. This time, however, that width varies over the length of the line as points fall our of the cluster. For our data using a minimum cluster size of 5 the result looks like this:" |
271 | 275 | ] |
272 | 276 | }, |
273 | 277 | { |
|
316 | 320 | "source": [ |
317 | 321 | "## Extract the clusters\n", |
318 | 322 | "\n", |
319 | | - "Intuitively we want the choose clusters that persist and have a longer lifetime; short lived clusters are ultimately probably merely artifcacts of the single linkage approach. Looking at the previous plot we could say that we want to choose those clusters that have the greatest area of ink in the plot. To make a flat clustering we will need to add a further requirement that if you select a cluster than you cannot select any cluster that is a descendant of it. And in fact that intuitive notion of what should be done is exactly what HDBSCAN does. Of course we need to formalise things to make it a concrete algorithm.\n", |
| 323 | + "Intuitively we want the choose clusters that persist and have a longer lifetime; short lived clusters are ultimately probably merely artifcacts of the single linkage approach. Looking at the previous plot we could say that we want to choose those clusters that have the greatest area of ink in the plot. To make a flat clustering we will need to add a further requirement that, if you select a cluster, then you cannot select any cluster that is a descendant of it. And in fact that intuitive notion of what should be done is exactly what HDBSCAN does. Of course we need to formalise things to make it a concrete algorithm.\n", |
320 | 324 | "\n", |
321 | 325 | "First we need a different measure than distance to consider the persistence of clusters; instead we will use $\\lambda = \\frac{1}{\\mathrm{distance}}$. For a given cluster we can then define values $\\lambda_{\\mathrm{birth}}$ and $\\lambda_{\\mathrm{death}}$ to be the lambda value when the cluster split off and became it's own cluster, and the lambda value (if any) when the cluster split into smaller clusters respectively. In turn, for a given cluster, for each point *p* in that cluster we can define the value $\\lambda_p$ as the lambda value at which that point 'fell out of the cluster' which is a value somewhere between $\\lambda_{\\mathrm{birth}}$ and $\\lambda_{\\mathrm{death}}$ since the point either falls out of the cluster at some point in the cluster's lifetime, or leaves the cluster when the cluster splits into two smaller clusters. Now, for each cluster compute the **stability** to as\n", |
322 | 326 | "\n", |
323 | 327 | "$\\sum_{p \\in \\mathrm{cluster}} (\\lambda_p - \\lambda_{\\mathrm{birth}})$.\n", |
324 | 328 | "\n", |
325 | | - "Declare all leaf nodes to be selected clusters. Now work up through the tree (the reverse topological sort order). If the sum of the stabilities of the child clusters is greater than the stability of the cluster we set the cluster stability to be the sum of the child stabilities, otherwise we declare the cluster to be a selected cluster, and unselect all its descendants. Once we reach the root node we call the current set of selected clusters our flat clsutering and return that.\n", |
| 329 | + "Declare all leaf nodes to be selected clusters. Now work up through the tree (the reverse topological sort order). If the sum of the stabilities of the child clusters is greater than the stability of the cluster then we set the cluster stability to be the sum of the child stabilities. If, on the other hand, the cluster's stability is greater than the su of it's children then we declare the cluster to be a selected cluster, and unselect all its descendants. Once we reach the root node we call the current set of selected clusters our flat clsutering and return that.\n", |
326 | 330 | "\n", |
327 | | - "Okay, that was wordy and complicated, but it really is simply performing our 'select the clusters in the plot with the largest ink area' subject to descendast constraints that we explained earlier. We can select the clusters in the condensed tree dendrogram via this algorithm, and you get what you expect:" |
| 331 | + "Okay, that was wordy and complicated, but it really is simply performing our 'select the clusters in the plot with the largest total ink area' subject to descendant constraints that we explained earlier. We can select the clusters in the condensed tree dendrogram via this algorithm, and you get what you expect:" |
328 | 332 | ] |
329 | 333 | }, |
330 | 334 | { |
|
363 | 367 | "cell_type": "markdown", |
364 | 368 | "metadata": {}, |
365 | 369 | "source": [ |
366 | | - "Now that we have the clusters it is a simple enough matter to turn that into cluster labelling as per the sklearn API. Any point not in a selected cluster is simply a noise point (and assigned the label -1). We can do a little more though: for each cluster we have the $\\lambda_p$ for each point *p* in that cluster; If we simply normalize those values (so they range from zero to one) then we have a measure of the strength of cluster membership for each point in the cluster. The hdbscan library returns this as a `probabilities_` attribute of the clusterer object. Thus, with labels and membership strengths in had we can make the standard plot, choosing a color for points based on cluster label, and desaturating that color according the strength of membership (and make unclustered points pure gray)." |
| 370 | + "Now that we have the clusters it is a simple enough matter to turn that into cluster labelling as per the sklearn API. Any point not in a selected cluster is simply a noise point (and assigned the label -1). We can do a little more though: for each cluster we have the $\\lambda_p$ for each point *p* in that cluster; If we simply normalize those values (so they range from zero to one) then we have a measure of the strength of cluster membership for each point in the cluster. The hdbscan library returns this as a `probabilities_` attribute of the clusterer object. Thus, with labels and membership strengths in hand we can make the standard plot, choosing a color for points based on cluster label, and desaturating that color according the strength of membership (and make unclustered points pure gray)." |
367 | 371 | ] |
368 | 372 | }, |
369 | 373 | { |
|
405 | 409 | "cell_type": "markdown", |
406 | 410 | "metadata": {}, |
407 | 411 | "source": [ |
408 | | - "And that is how HDBSCAN works. It may seem somewhat complicated -- there are a fair number of moving parts to the algorithm -- but ultimately each part is actually very straightforward and can be optimized well. Hopefully with a better understanding both of the intuitions and some of the implementation details of HDBSCAN you will feel motivated to [try it out](https://github.com/lmcinnes/hdbscan). In turn I expect to see the hdbscan library continue to develop; various improvements, such as getting the asymptotic complexity down to $O(n \\log n)$, support for streaming clustering, and continued development of visualization and analysis tools." |
| 412 | + "And that is how HDBSCAN works. It may seem somewhat complicated -- there are a fair number of moving parts to the algorithm -- but ultimately each part is actually very straightforward and can be optimized well. Hopefully with a better understanding both of the intuitions and some of the implementation details of HDBSCAN you will feel motivated to [try it out](https://github.com/lmcinnes/hdbscan). In turn I expect to see the hdbscan library continue to develop; various improvements, such as getting the asymptotic complexity down to $O(n \\log n)$, support for streaming clustering, and continued development of visualization and analysis tools are all planned and hopefully coming soon." |
409 | 413 | ] |
410 | 414 | }, |
411 | 415 | { |
|
420 | 424 | ], |
421 | 425 | "metadata": { |
422 | 426 | "kernelspec": { |
423 | | - "display_name": "Python 3", |
| 427 | + "display_name": "Python 2", |
424 | 428 | "language": "python", |
425 | | - "name": "python3" |
| 429 | + "name": "python2" |
426 | 430 | }, |
427 | 431 | "language_info": { |
428 | 432 | "codemirror_mode": { |
429 | 433 | "name": "ipython", |
430 | | - "version": 3 |
| 434 | + "version": 2 |
431 | 435 | }, |
432 | 436 | "file_extension": ".py", |
433 | 437 | "mimetype": "text/x-python", |
434 | 438 | "name": "python", |
435 | 439 | "nbconvert_exporter": "python", |
436 | | - "pygments_lexer": "ipython3", |
437 | | - "version": "3.4.3" |
| 440 | + "pygments_lexer": "ipython2", |
| 441 | + "version": "2.7.10" |
438 | 442 | } |
439 | 443 | }, |
440 | 444 | "nbformat": 4, |
|
0 commit comments