Skip to content

Commit b729cfe

Browse files
authored
Merge pull request #330 from cmalzer/master
Documentation of epsilon parameter
2 parents 6c1a6d4 + ea37abf commit b729cfe

8 files changed

+85
-2
lines changed

docs/how_to_use_epsilon.rst

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
2+
Combining HDBSCAN\* with DBSCAN
3+
=============================
4+
5+
While DBSCAN needs a minimum cluster size *and* a distance threshold epsilon as user-defined input parameters,
6+
HDBSCAN\* is basically a DBSCAN implementation for varying epsilon values and therefore only needs the minimum cluster size as single input parameter.
7+
The ``'eom'`` (Excess of Mass) cluster selection method then returns clusters with the best stability over epsilon.
8+
9+
Unlike DBSCAN, this allows to it find clusters of variable densities without having to choose a suitable distance threshold first.
10+
However, there are cases where we could still benefit from the use of an epsilon threshold.
11+
12+
For illustration, see this map with GPS locations, representing recorded pick-up and drop-off locations for customers of a ride pooling provider.
13+
The largest (visual) data cluster can be found around the train station. Smaller clusters are placed along the streets, depending on the requested location
14+
in the form of a postal address or point of interest. Since we are considering a door-to-door system where customers are not bound to collective pick-up or
15+
drop-off locations, we are interested in both large clusters and small clusters with a minimum size of 4.
16+
17+
.. image:: images/epsilon_parameter_dataset.png
18+
:align: center
19+
20+
Clustering the given data set with `DBSCAN <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html>`__ and an epsilon threshold of 5 meters gives us good results,
21+
but neglects clusters with points that are more than 5 meters apart from each other.
22+
However, increasing epsilon would result in cluster chains along the streets, especially when working with a larger data set.
23+
24+
.. image:: images/epsilon_parameter_dbscan.png
25+
:align: center
26+
27+
Unfortunately, HDBSCAN\* does not produce any better results in this case: while it discovers the clusters that DBSCAN missed, it also returns a very high number of micro-clusters around the train station,
28+
even though we would prefer one or only few clusters representing this location. We could achieve this by increasing ``min_cluster_size`` or
29+
the smoothing parameter ``min_samples``, but with the trade-off of losing small clusters in less dense areas or merging them into other clusters
30+
separated by a relatively large distance.
31+
32+
.. image:: images/epsilon_parameter_HDBSCAN_eom.png
33+
:align: center
34+
35+
This is where the parameter ``cluster_selection_epsilon`` comes into play. The cluster extraction method using this parameter, as described in detail
36+
by `Malzer and Baum <https://arxiv.org/abs/1911.02282>`__, acts like a hybrid between DBSCAN
37+
(or, to be precise, DBSCAN\*, i.e. DBSCAN without the border points) by extracting DBSCAN results for data partitions
38+
affected by the given parameter value, and HDBSCAN\* results for all others.
39+
40+
In our example, we choose to merge nested clusters below 5 meters (0.005 kilometers) and therefore set the parameter ``cluster_selection_epsilon`` accordingly:
41+
42+
.. code:: python
43+
44+
X = np.radians(coordinates) #convert the list of lat/lon coordinates to radians
45+
earth_radius_km = 6371
46+
epsilon = 0.005 / earth_radius #calculate 5 meter epsilon threshold
47+
48+
clusterer = hdbscan.HDBSCAN(min_cluster_size=4, metric='haversine',
49+
cluster_selection_epsilon=epsilon, cluster_selection_method = 'eom')
50+
clusterer.fit(X)
51+
52+
And indeed, the result looks like a mix between DBSCAN and HDBSCAN(eom). We no longer lose clusters of variable densities beyond the given epsilon, but at the
53+
same time avoid the abundance of micro-clusters in the original HDBSCAN\* clustering, which was an undesired side-effect of having to choose a low ``min_cluster_size`` value.
54+
55+
.. image:: images/epsilon_parameter_HDBSCAN_eps.png
56+
:align: center
57+
58+
Note that for the given parameter setting, running HDBSCAN\* based on ``cluster_selection_method = 'eom'`` or ``cluster_selection_method = 'leaf'`` does not make
59+
any difference: the ``cluster_selection_epsilon`` threshold neutralizes the effect of HDBSCAN(eom)'s stability calculations.
60+
When using a lower threshold, some minor differences can be noticed. For example, an epsilon value of 3 meters with ``'eom'`` produces the same results as
61+
a the 5 meter value on the given data set, but 3 meters in combination with ``'leaf'`` achieves a slightly different result:
62+
63+
.. image:: images/epsilon_parameter_HDBSCAN_e3_leaf.png
64+
:align: center
65+
66+
A ``cluster_selection_epsilon`` value of 0 (the default value) always returns the original HDBSCAN\* results, either according to ``'eom'`` or ``'leaf'``.
67+
123 KB
Loading
126 KB
Loading
124 KB
Loading
136 KB
Loading
154 KB
Loading

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ User Guide / Tutorial
2424
outlier_detection
2525
prediction_tutorial
2626
soft_clustering
27+
how_to_use_epsilon
2728
faq
2829

2930
Background on Clustering with HDBSCAN

docs/parameter_selection.rst

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,21 @@ above, make the clustering progressively more conservative, culminating
168168
in the example above where ``min_samples`` was set to 60 and we had only
169169
two clusters with most points declared as noise.
170170

171+
.. _epsilon_label:
172+
173+
Selecting ``cluster_selection_epsilon``
174+
---------------------------------------
175+
176+
In some cases, we want to choose a small ``min_cluster_size`` because even groups of few points might be of interest to us.
177+
However, if our data set also contains partitions with high concentrations of objects, this parameter setting can result in
178+
a large number of micro-clusters. Selecting a value for ``cluster_selection_epsilon`` helps us to merge clusters in these regions.
179+
Or in other words, it ensures that clusters below the given threshold are not split up any further.
180+
181+
The choice of ``cluster_selection_epsilon`` depends on the given distances between your data points. For example, set the value to 0.5 if you don't want to
182+
separate clusters that are less than 0.5 units apart. This will basically extract DBSCAN* clusters for epsilon = 0.5 from the condensed cluster tree, but leave
183+
HDBSCAN* clusters that emerged at distances greater than 0.5 untouched. See :doc:`how_to_use_epsilon` for a more detailed demonstration of the effect this parameter
184+
has on the resulting clustering.
185+
171186
.. _alpha_label:
172187

173188
Selecting ``alpha``
@@ -176,8 +191,8 @@ Selecting ``alpha``
176191
A further parameter that effects the resulting clustering is ``alpha``.
177192
In practice it is best not to mess with this parameter -- ultimately it
178193
is part of the ``RobustSingleLinkage`` code, but flows naturally into
179-
HDBSCAN\*. If, for some reason, ``min_samples`` is not providing you
180-
what you need, stop, rethink things, and try again with ``min_samples``.
194+
HDBSCAN\*. If, for some reason, ``min_samples`` or ``cluster_selection_epsilon`` is not providing you
195+
what you need, stop, rethink things, and try again with ``min_samples`` or ``cluster_selection_epsilon``.
181196
If you still need to play with another parameter (and you shouldn't),
182197
then you can try setting ``alpha``. The ``alpha`` parameter provides a
183198
slightly different approach to determining how conservative the

0 commit comments

Comments
 (0)