|
| 1 | + |
| 2 | +Extracting DBSCAN* clustering from HDBSCAN* |
| 3 | +=========================================== |
| 4 | + |
| 5 | +There are a number of reasons that one might prefer `DBSCAN <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html>`__'s |
| 6 | +clustering over that of HDBSCAN*. The biggest difficulty many folks have with |
| 7 | +DBSCAN is that the epsilon distance parameter can be hard to determine and often |
| 8 | +requires a great deal of trial and error to tune. If your data lived in a more |
| 9 | +interpretable space and you had a good notion of distance in that space this problem |
| 10 | +is certainly mitigated and a user might want to set a very specific epsilon distance |
| 11 | +for their use case. Another viable use case might be that a user is interested in a |
| 12 | +constant density clustering. |
| 13 | +HDBSCAN* does variable density clustering by default, looking for the clusters that persist |
| 14 | +over a wide range of epsilon distance parameters to find a 'natural' clustering. This might |
| 15 | +not be the right result for your application. A DBSCAN clustering at a particular |
| 16 | +epsilon value might work better for your particular task. |
| 17 | + |
| 18 | +HDBSCAN returns a very natural clustering of your data which is often very useful in exploring |
| 19 | +a new data set. That doesn't necessarily make it the right clustering algorithm or every |
| 20 | +task. |
| 21 | + |
| 22 | +HDBSCAN* can best be thought of as a DBSCAN* implementation which varies across |
| 23 | +all epsilon values and extracts the clusters that persist over the widest range |
| 24 | +of these parameter choices. It is therefore able to ignore the parameter and |
| 25 | +only needs the minimum cluster size as single input parameter. |
| 26 | +The 'eom' (Excess of Mass) cluster selection method then returns clusters with the |
| 27 | +best stability over epsilon. |
| 28 | + |
| 29 | +There are a number of alternative ways of extracting a flat clustering from |
| 30 | +the HDBSCAN* hierarchical tree. If one is interested in finer resolution |
| 31 | +clusters while still maintaining variable density one could set |
| 32 | +``cluster_selection_method='leaf'`` to extract the leaves of the condensed |
| 33 | +tree instead of the most persistent clusters. For more details on these |
| 34 | +cluster selection methods see :ref:`leaf_clustering_label`. |
| 35 | + |
| 36 | +If one wasn't interested in the variable density clustering that is the hallmark of |
| 37 | +HDBSCAN* it is relatively easy to extract any DBSCAN* clustering from a |
| 38 | +single run of HDBSCAN*. This has the advantage of allowing you to perform |
| 39 | +a single computationally efficient HDBSCAN* run and then quickly search over |
| 40 | +the DBSCAN* parameter space by extracting clustering results from our |
| 41 | +pre-constructed tree. This can save significant computational time when |
| 42 | +searching across multiple cluster parameter settings on large amounts of data. |
| 43 | + |
| 44 | +Alternatively, one could make use of the ``cluster_selection_epsilon`` as a |
| 45 | +post processing step with any ``cluster_selection_method`` in order to |
| 46 | +return a hybrid clustering of DBSCAN* and HDBSCAN*. For more details on |
| 47 | +this see :doc:`how_to_use_epsilon`. |
| 48 | + |
| 49 | +In order to extract a DBSCAN* clustering from an HDBSCAN run we must first train |
| 50 | +and HDBSCAN model on our data. |
| 51 | + |
| 52 | +.. code:: python |
| 53 | +
|
| 54 | + import hdbscan |
| 55 | + h_cluster = hdbscan.HDBSCAN(min_samples=5,match_reference_implementation=True).fit(X) |
| 56 | +
|
| 57 | +The ``min_cluster_size`` parameter is unimportant in this case in that it is |
| 58 | +only used in the creation of our condensed tree which we won't be using here. |
| 59 | +Now we choose a ``cut_distance`` which is just another name for the epsilon |
| 60 | +threshold in DBSCAN and will be passed to our |
| 61 | +:py:meth:`~hdbscan.hdbscan_.dbscan_clustering` method. |
| 62 | + |
| 63 | +.. code:: python |
| 64 | +
|
| 65 | + eps = 0.2 |
| 66 | + labels = h_cluster.dbscan_clustering(cut_distance=eps, min_cluster_size=5) |
| 67 | + sns.scatterplot(x=X[:,0], y=X[:,1], hue=labels.astype(str)); |
| 68 | +
|
| 69 | +.. image:: images/dbscan_from_hdbscan_clustering.png |
| 70 | + :align: center |
| 71 | + |
| 72 | +It should be noted that a DBSCAN* clustering extracted from our HDBSCAN* tree will |
| 73 | +not precisely match the clustering results from sklearn's DBSCAN implementation. |
| 74 | +Our clustering results should better match DBSCAN* (which can be thought of as |
| 75 | +DBSCAN without the border points). As such when comparing the two results one |
| 76 | +should expect them to mostly differ in the points that DBSCAN considers boarder |
| 77 | +points. We'll deal with |
| 78 | +this by only looking at the comparison of our clustering results based on the points identified |
| 79 | +by DBSCAN as core points. We can see below that the differences between these two |
| 80 | +clusterings mostly occur in the boundaries of the clusters. This matches our |
| 81 | +intuition of stability within the core points. |
| 82 | + |
| 83 | +.. image:: images/dbscan_from_hdbscan_comparision.png |
| 84 | + :align: center |
| 85 | + |
| 86 | +For a slightly more empirical comparison we we make use of the `adjusted rand score <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html>`__ |
| 87 | +to compare the clustering of the core points between a DBSCAN cluster from sklearn and |
| 88 | +a DBSCAN* clustering extracted from our HDBSCAN* object. |
| 89 | + |
| 90 | +.. image:: images/dbscan_from_hdbscan_percentage_core.png |
| 91 | + :align: center |
| 92 | + |
| 93 | +.. image:: images/dbscan_from_hdbscan_number_of_clusters.png |
| 94 | + :align: center |
| 95 | + |
| 96 | +We see that for very small epsilon values our number of clusters tends to be quite |
| 97 | +far apart, largely due to a large number of the points being considered boundary points |
| 98 | +instead of core points. As the epsilon value increases, more and more points are |
| 99 | +considered core and the number of clusters generated by each algorithm converge. |
| 100 | + |
| 101 | +Additionally, the adjusted rand score between the core points of both algorithm |
| 102 | +stays consistently high (mostly 1.0) for our entire range of epsilon. There may be |
| 103 | +be some minor discrepancies between core point results largely due to implementation |
| 104 | +details and optimizations with the code base. |
| 105 | + |
| 106 | +Why might one just extract the DBSCAN* clustering results from a single HDBSCAN* run |
| 107 | +instead of making use of sklearns DBSSCAN code? The short answer is efficiency. |
| 108 | +If you aren't sure what epsilon parameter to select for DBSCAN then you may have to |
| 109 | +run the algorithm many times on your data set. While those runs can be inexpensive for |
| 110 | +very small epsilon values they can get quite expensive for large parameter values. |
| 111 | + |
| 112 | +In this small benchmark case of 50,000 two dimensional data points we have broken even |
| 113 | +after having only had to try two epsilon parameters from DBSCAN, or only a single |
| 114 | +run with a large parameter selected. This trend is only exacerbated for larger |
| 115 | +data sets in higher dimensional spaces. For more detailed scaling experiments see |
| 116 | +`Accelearted Hierarchical Density Clustering <https://arxiv.org/abs/1705.07321>`__ |
| 117 | +by McInnes and Healy. |
| 118 | + |
| 119 | +.. image:: images/dbscan_from_hdbscan_timing.png |
| 120 | + :align: center |
| 121 | + |
| 122 | + |
| 123 | + |
| 124 | + |
| 125 | + |
| 126 | + |
0 commit comments