|
| 1 | + |
| 2 | +Combining HDBSCAN\* with DBSCAN |
| 3 | +============================= |
| 4 | + |
| 5 | +While DBSCAN needs a minimum cluster size *and* a distance threshold epsilon as user-defined input parameters, |
| 6 | +HDBSCAN\* is basically a DBSCAN implementation for varying epsilon values and therefore only needs the minimum cluster size as single input parameter. |
| 7 | +The ``'eom'`` (Excess of Mass) cluster selection method then returns clusters with the best stability over epsilon. |
| 8 | + |
| 9 | +Unlike DBSCAN, this allows to it find clusters of variable densities without having to choose a suitable distance threshold first. |
| 10 | +However, there are cases where we could still benefit from the use of an epsilon threshold. |
| 11 | + |
| 12 | +For illustration, see this map with GPS locations, representing recorded pick-up and drop-off locations for customers of a ride pooling provider. |
| 13 | +The largest (visual) data cluster can be found around the train station. Smaller clusters are placed along the streets, depending on the requested location |
| 14 | +in the form of a postal address or point of interest. Since we are considering a door-to-door system where customers are not bound to collective pick-up or |
| 15 | +drop-off locations, we are interested in both large clusters and small clusters with a minimum size of 4. |
| 16 | + |
| 17 | +.. image:: images/epsilon_parameter_dataset.png |
| 18 | + :align: center |
| 19 | + |
| 20 | +Clustering the given data set with `DBSCAN <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html>`__ and an epsilon threshold of 5 meters gives us good results, |
| 21 | +but neglects clusters with points that are more than 5 meters apart from each other. |
| 22 | +However, increasing epsilon would result in cluster chains along the streets, especially when working with a larger data set. |
| 23 | + |
| 24 | +.. image:: images/epsilon_parameter_dbscan.png |
| 25 | + :align: center |
| 26 | + |
| 27 | +Unfortunately, HDBSCAN\* does not produce any better results in this case: while it discovers the clusters that DBSCAN missed, it also returns a very high number of micro-clusters around the train station, |
| 28 | +even though we would prefer one or only few clusters representing this location. We could achieve this by increasing ``min_cluster_size`` or |
| 29 | +the smoothing parameter ``min_samples``, but with the trade-off of losing small clusters in less dense areas or merging them into other clusters |
| 30 | +separated by a relatively large distance. |
| 31 | + |
| 32 | +.. image:: images/epsilon_parameter_HDBSCAN_eom.png |
| 33 | + :align: center |
| 34 | + |
| 35 | +This is where the parameter ``cluster_selection_epsilon`` comes into play. The cluster extraction method using this parameter, as described in detail |
| 36 | +by `Malzer and Baum <https://arxiv.org/abs/1911.02282>`__, acts like a hybrid between DBSCAN |
| 37 | +(or, to be precise, DBSCAN\*, i.e. DBSCAN without the border points) by extracting DBSCAN results for data partitions |
| 38 | +affected by the given parameter value, and HDBSCAN\* results for all others. |
| 39 | + |
| 40 | +In our example, we choose to merge nested clusters below 5 meters (0.005 kilometers) and therefore set the parameter ``cluster_selection_epsilon`` accordingly: |
| 41 | + |
| 42 | +.. code:: python |
| 43 | +
|
| 44 | + X = np.radians(coordinates) #convert the list of lat/lon coordinates to radians |
| 45 | + earth_radius_km = 6371 |
| 46 | + epsilon = 0.005 / earth_radius #calculate 5 meter epsilon threshold |
| 47 | + |
| 48 | + clusterer = hdbscan.HDBSCAN(min_cluster_size=4, metric='haversine', |
| 49 | + cluster_selection_epsilon=epsilon, cluster_selection_method = 'eom') |
| 50 | + clusterer.fit(X) |
| 51 | + |
| 52 | +And indeed, the result looks like a mix between DBSCAN and HDBSCAN(eom). We no longer lose clusters of variable densities beyond the given epsilon, but at the |
| 53 | +same time avoid the abundance of micro-clusters in the original HDBSCAN\* clustering, which was an undesired side-effect of having to choose a low ``min_cluster_size`` value. |
| 54 | + |
| 55 | +.. image:: images/epsilon_parameter_HDBSCAN_eps.png |
| 56 | + :align: center |
| 57 | + |
| 58 | +Note that for the given parameter setting, running HDBSCAN\* based on ``cluster_selection_method = 'eom'`` or ``cluster_selection_method = 'leaf'`` does not make |
| 59 | +any difference: the ``cluster_selection_epsilon`` threshold neutralizes the effect of HDBSCAN(eom)'s stability calculations. |
| 60 | +When using a lower threshold, some minor differences can be noticed. For example, an epsilon value of 3 meters with ``'eom'`` produces the same results as |
| 61 | +a the 5 meter value on the given data set, but 3 meters in combination with ``'leaf'`` achieves a slightly different result: |
| 62 | + |
| 63 | +.. image:: images/epsilon_parameter_HDBSCAN_e3_leaf.png |
| 64 | + :align: center |
| 65 | + |
| 66 | +A ``cluster_selection_epsilon`` value of 0 (the default value) always returns the original HDBSCAN\* results, either according to ``'eom'`` or ``'leaf'``. |
| 67 | + |
0 commit comments