scikit-learn-contrib
diff --git a/‎docs/how_to_use_epsilon.rst‎
Lines changed: 67 additions & 0 deletions b/‎docs/how_to_use_epsilon.rst‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎docs/images/epsilon_parameter_dataset.PNG‎
123 KB b/‎docs/images/epsilon_parameter_dataset.PNG‎
123 KB
diff --git a/‎docs/images/epsilon_parameter_dbscan.PNG‎
126 KB b/‎docs/images/epsilon_parameter_dbscan.PNG‎
126 KB
diff --git a/‎docs/images/epsilon_parameter_hdbscan_e3_leaf.png‎
124 KB b/‎docs/images/epsilon_parameter_hdbscan_e3_leaf.png‎
124 KB
diff --git a/‎docs/images/epsilon_parameter_hdbscan_eom.PNG‎
136 KB b/‎docs/images/epsilon_parameter_hdbscan_eom.PNG‎
136 KB
diff --git a/‎docs/images/epsilon_parameter_hdbscan_eps.PNG‎
154 KB b/‎docs/images/epsilon_parameter_hdbscan_eps.PNG‎
154 KB
diff --git a/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/parameter_selection.rst‎
Lines changed: 17 additions & 2 deletions b/‎docs/parameter_selection.rst‎
Lines changed: 17 additions & 2 deletions
@@ -0,0 +1,67 @@
+
+Combining HDBSCAN\* with DBSCAN 
+=============================
+
+While DBSCAN needs a minimum cluster size *and* a distance threshold epsilon as user-defined input parameters, 
+HDBSCAN\* is basically a DBSCAN implementation for varying epsilon values and therefore only needs the minimum cluster size as single input parameter.
+The ``'eom'`` (Excess of Mass) cluster selection method then returns clusters with the best stability over epsilon.
+
+Unlike DBSCAN, this allows to it find clusters of variable densities without having to choose a suitable distance threshold first.
+However, there are cases where we could still benefit from the use of an epsilon threshold.
+
+For illustration, see this map with GPS locations, representing recorded pick-up and drop-off locations for customers of a ride pooling provider.
+The largest (visual) data cluster can be found around the train station. Smaller clusters are placed along the streets, depending on the requested location
+in the form of a postal address or point of interest. Since we are considering a door-to-door system where customers are not bound to collective pick-up or
+drop-off locations, we are interested in both large clusters and small clusters with a minimum size of 4.  
+
+.. image:: images/epsilon_parameter_dataset.png
+	:align: center
+	
+Clustering the given data set with `DBSCAN <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html>`__ and an epsilon threshold of 5 meters gives us good results, 
+but neglects clusters with points that are more than 5 meters apart from each other. 
+However, increasing epsilon would result in cluster chains along the streets, especially when working with a larger data set. 
+
+.. image:: images/epsilon_parameter_dbscan.png
+	:align: center
+
+Unfortunately, HDBSCAN\* does not produce any better results in this case: while it discovers the clusters that DBSCAN missed, it also returns a very high number of micro-clusters around the train station, 
+even though we would prefer one or only few clusters representing this location. We could achieve this by increasing ``min_cluster_size`` or 
+the smoothing parameter ``min_samples``, but with the trade-off of losing small clusters in less dense areas or merging them into other clusters 
+separated by a relatively large distance.
+
+.. image:: images/epsilon_parameter_HDBSCAN_eom.png
+	:align: center
+	
+This is where the parameter ``cluster_selection_epsilon`` comes into play. The cluster extraction method using this parameter, as described in detail
+by `Malzer and Baum <https://arxiv.org/abs/1911.02282>`__, acts like a hybrid between DBSCAN 
+(or, to be precise, DBSCAN\*, i.e. DBSCAN without the border points) by extracting DBSCAN results for data partitions
+affected by the given parameter value, and HDBSCAN\* results for all others. 
+
+In our example, we choose to merge nested clusters below 5 meters (0.005 kilometers) and therefore set  the parameter ``cluster_selection_epsilon`` accordingly: 
+
+.. code:: python
+
+	X = np.radians(coordinates) #convert the list of lat/lon coordinates to radians
+	earth_radius_km = 6371
+	epsilon = 0.005 / earth_radius #calculate 5 meter epsilon threshold
+	
+	clusterer = hdbscan.HDBSCAN(min_cluster_size=4, metric='haversine', 
+	cluster_selection_epsilon=epsilon, cluster_selection_method = 'eom')
+	clusterer.fit(X)
+	
+And indeed, the result looks like a mix between DBSCAN and HDBSCAN(eom). We no longer lose clusters of variable densities beyond the given epsilon, but at the
+same time avoid the abundance of micro-clusters in the original HDBSCAN\* clustering, which was an undesired side-effect of having to choose a low ``min_cluster_size`` value.
+
+.. image:: images/epsilon_parameter_HDBSCAN_eps.png
+	:align: center
+	
+Note that for the given parameter setting, running HDBSCAN\* based on ``cluster_selection_method = 'eom'`` or ``cluster_selection_method = 'leaf'`` does not make
+any difference: the ``cluster_selection_epsilon`` threshold neutralizes the effect of HDBSCAN(eom)'s stability calculations.
+When using a lower threshold, some minor differences can be noticed. For example, an epsilon value of 3 meters with ``'eom'`` produces the same results as
+a the 5 meter value on the given data set, but 3 meters in combination with ``'leaf'`` achieves a slightly different result:
+	
+.. image:: images/epsilon_parameter_HDBSCAN_e3_leaf.png
+	:align: center
+
+A ``cluster_selection_epsilon`` value of 0 (the default value) always returns the original HDBSCAN\* results, either according to ``'eom'`` or ``'leaf'``.
+	
@@ -24,6 +24,7 @@ User Guide / Tutorial
    outlier_detection
    prediction_tutorial
    soft_clustering
+   how_to_use_epsilon
    faq
 
 Background on Clustering with HDBSCAN
 
@@ -168,6 +168,21 @@ above, make the clustering progressively more conservative, culminating
 in the example above where ``min_samples`` was set to 60 and we had only
 two clusters with most points declared as noise.
 
+.. _epsilon_label:
+
+Selecting ``cluster_selection_epsilon``
+---------------------------------------
+
+In some cases, we want to choose a small ``min_cluster_size`` because even groups of few points might be of interest to us.
+However, if our data set also contains partitions with high concentrations of objects, this parameter setting can result in
+a large number of micro-clusters. Selecting a value for ``cluster_selection_epsilon`` helps us to merge clusters in these regions.
+Or in other words, it ensures that clusters below the given threshold are not split up any further.
+
+The choice of ``cluster_selection_epsilon`` depends on the given distances between your data points. For example, set the value to 0.5 if you don't want to
+separate clusters that are less than 0.5 units apart. This will basically extract DBSCAN* clusters for epsilon = 0.5 from the condensed cluster tree, but leave
+HDBSCAN* clusters that emerged at distances greater than 0.5 untouched. See :doc:`how_to_use_epsilon` for a more detailed demonstration of the effect this parameter
+has on the resulting clustering.
+
 .. _alpha_label:
 
 Selecting ``alpha``
@@ -176,8 +191,8 @@ Selecting ``alpha``
 A further parameter that effects the resulting clustering is ``alpha``.
 In practice it is best not to mess with this parameter -- ultimately it
 is part of the ``RobustSingleLinkage`` code, but flows naturally into
-HDBSCAN\*. If, for some reason, ``min_samples`` is not providing you
-what you need, stop, rethink things, and try again with ``min_samples``.
+HDBSCAN\*. If, for some reason, ``min_samples`` or ``cluster_selection_epsilon`` is not providing you
+what you need, stop, rethink things, and try again with ``min_samples`` or ``cluster_selection_epsilon``.
 If you still need to play with another parameter (and you shouldn't),
 then you can try setting ``alpha``. The ``alpha`` parameter provides a
 slightly different approach to determining how conservative the