Skip to content

Commit ec3b205

Browse files
committed
Merge remote-tracking branch 'origin/master'
2 parents c13be8a + 2179c24 commit ec3b205

13 files changed

+1189
-530
lines changed

azure-pipelines.yml

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# Python package
2+
# Create and test a Python package on multiple Python versions.
3+
# Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more:
4+
# https://docs.microsoft.com/azure/devops/pipelines/languages/python
5+
6+
trigger:
7+
- master
8+
9+
jobs:
10+
- job: Linux
11+
pool:
12+
vmImage: ubuntu-latest
13+
strategy:
14+
matrix:
15+
Python37:
16+
python.version: '3.7'
17+
Python38:
18+
python.version: '3.8'
19+
Python39:
20+
python.version: '3.9'
21+
22+
steps:
23+
- task: UsePythonVersion@0
24+
inputs:
25+
versionSpec: '$(python.version)'
26+
displayName: 'Use Python $(python.version)'
27+
28+
- script: |
29+
python -m pip install --upgrade pip
30+
pip install -r requirements.txt
31+
displayName: 'Install dependencies'
32+
33+
- script: |
34+
pip install cython
35+
python setup.py develop
36+
37+
- script: |
38+
pip install pytest pytest-azurepipelines
39+
pytest
40+
displayName: 'pytest'
41+
42+
- task: PublishTestResults@2
43+
inputs:
44+
testResultsFiles: 'pytest.xml'
45+
testRunTitle: '$(Agent.OS) - $(Build.BuildNumber)[$(Agent.JobName)] - Python $(python.version)'
46+
condition: succeededOrFailed()
47+
48+
- job: Windows
49+
pool:
50+
vmImage: 'windows-latest'
51+
strategy:
52+
matrix:
53+
Python36:
54+
python.version: '3.6'
55+
Python37:
56+
python.version: '3.7'
57+
Python38:
58+
python.version: '3.8'
59+
Python39:
60+
python.version: '3.9'
61+
62+
steps:
63+
- task: UsePythonVersion@0
64+
inputs:
65+
versionSpec: '$(python.version)'
66+
displayName: 'Use Python $(python.version)'
67+
68+
- script: |
69+
python -m pip install --upgrade pip
70+
pip install -r requirements.txt
71+
displayName: 'Install dependencies'
72+
73+
- script: |
74+
pip install cython
75+
python setup.py develop
76+
77+
- script: |
78+
pip install pytest pytest-azurepipelines
79+
pytest
80+
displayName: 'pytest'
81+
82+
- job: MacOS
83+
pool:
84+
vmImage: 'macos-latest'
85+
strategy:
86+
matrix:
87+
Python37:
88+
python.version: '3.7'
89+
Python38:
90+
python.version: '3.8'
91+
Python39:
92+
python.version: '3.9'
93+
94+
steps:
95+
- task: UsePythonVersion@0
96+
inputs:
97+
versionSpec: '$(python.version)'
98+
displayName: 'Use Python $(python.version)'
99+
100+
- script: |
101+
python -m pip install --upgrade pip
102+
pip install -r requirements.txt
103+
displayName: 'Install dependencies'
104+
105+
- script: |
106+
pip install cython
107+
python setup.py develop
108+
109+
- script: |
110+
pip install pytest pytest-azurepipelines
111+
pytest
112+
displayName: 'pytest'
113+
114+
- job: Coverage
115+
pool:
116+
vmImage: ubuntu-latest
117+
strategy:
118+
matrix:
119+
Python39:
120+
python.version: '3.9'
121+
122+
steps:
123+
- task: UsePythonVersion@0
124+
inputs:
125+
versionSpec: '$(python.version)'
126+
displayName: 'Use Python $(python.version)'
127+
128+
- script: |
129+
python -m pip install --upgrade pip
130+
pip install -r requirements.txt
131+
displayName: 'Install dependencies'
132+
133+
- script: |
134+
pip install cython
135+
pip install pytest
136+
pip install pytest-cov
137+
pip install coveralls
138+
pip install codecov
139+
python setup.py develop
140+
141+
- script: |
142+
pip install pytest pytest-azurepipelines
143+
pytest hdbscan/tests --show-capture=no -v --disable-warnings --junitxml=pytest.xml --cov=hdbscan/ --cov-report=xml --cov-report=html
144+
codecov
145+
displayName: 'pytest'
146+
147+
- task: PublishTestResults@2
148+
inputs:
149+
testResultsFiles: 'pytest.xml'
150+
testRunTitle: '$(Agent.OS) - $(Build.BuildNumber)[$(Agent.JobName)] - Python $(python.version)'
151+
condition: succeededOrFailed()

docs/dbscan_from_hdbscan.rst

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
2+
Extracting DBSCAN* clustering from HDBSCAN*
3+
===========================================
4+
5+
There are a number of reasons that one might prefer `DBSCAN <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html>`__'s
6+
clustering over that of HDBSCAN*. The biggest difficulty many folks have with
7+
DBSCAN is that the epsilon distance parameter can be hard to determine and often
8+
requires a great deal of trial and error to tune. If your data lived in a more
9+
interpretable space and you had a good notion of distance in that space this problem
10+
is certainly mitigated and a user might want to set a very specific epsilon distance
11+
for their use case. Another viable use case might be that a user is interested in a
12+
constant density clustering.
13+
HDBSCAN* does variable density clustering by default, looking for the clusters that persist
14+
over a wide range of epsilon distance parameters to find a 'natural' clustering. This might
15+
not be the right result for your application. A DBSCAN clustering at a particular
16+
epsilon value might work better for your particular task.
17+
18+
HDBSCAN returns a very natural clustering of your data which is often very useful in exploring
19+
a new data set. That doesn't necessarily make it the right clustering algorithm or every
20+
task.
21+
22+
HDBSCAN* can best be thought of as a DBSCAN* implementation which varies across
23+
all epsilon values and extracts the clusters that persist over the widest range
24+
of these parameter choices. It is therefore able to ignore the parameter and
25+
only needs the minimum cluster size as single input parameter.
26+
The 'eom' (Excess of Mass) cluster selection method then returns clusters with the
27+
best stability over epsilon.
28+
29+
There are a number of alternative ways of extracting a flat clustering from
30+
the HDBSCAN* hierarchical tree. If one is interested in finer resolution
31+
clusters while still maintaining variable density one could set
32+
``cluster_selection_method='leaf'`` to extract the leaves of the condensed
33+
tree instead of the most persistent clusters. For more details on these
34+
cluster selection methods see :ref:`leaf_clustering_label`.
35+
36+
If one wasn't interested in the variable density clustering that is the hallmark of
37+
HDBSCAN* it is relatively easy to extract any DBSCAN* clustering from a
38+
single run of HDBSCAN*. This has the advantage of allowing you to perform
39+
a single computationally efficient HDBSCAN* run and then quickly search over
40+
the DBSCAN* parameter space by extracting clustering results from our
41+
pre-constructed tree. This can save significant computational time when
42+
searching across multiple cluster parameter settings on large amounts of data.
43+
44+
Alternatively, one could make use of the ``cluster_selection_epsilon`` as a
45+
post processing step with any ``cluster_selection_method`` in order to
46+
return a hybrid clustering of DBSCAN* and HDBSCAN*. For more details on
47+
this see :doc:`how_to_use_epsilon`.
48+
49+
In order to extract a DBSCAN* clustering from an HDBSCAN run we must first train
50+
and HDBSCAN model on our data.
51+
52+
.. code:: python
53+
54+
import hdbscan
55+
h_cluster = hdbscan.HDBSCAN(min_samples=5,match_reference_implementation=True).fit(X)
56+
57+
The ``min_cluster_size`` parameter is unimportant in this case in that it is
58+
only used in the creation of our condensed tree which we won't be using here.
59+
Now we choose a ``cut_distance`` which is just another name for the epsilon
60+
threshold in DBSCAN and will be passed to our
61+
:py:meth:`~hdbscan.hdbscan_.dbscan_clustering` method.
62+
63+
.. code:: python
64+
65+
eps = 0.2
66+
labels = h_cluster.dbscan_clustering(cut_distance=eps, min_cluster_size=5)
67+
sns.scatterplot(x=X[:,0], y=X[:,1], hue=labels.astype(str));
68+
69+
.. image:: images/dbscan_from_hdbscan_clustering.png
70+
:align: center
71+
72+
It should be noted that a DBSCAN* clustering extracted from our HDBSCAN* tree will
73+
not precisely match the clustering results from sklearn's DBSCAN implementation.
74+
Our clustering results should better match DBSCAN* (which can be thought of as
75+
DBSCAN without the border points). As such when comparing the two results one
76+
should expect them to mostly differ in the points that DBSCAN considers boarder
77+
points. We'll deal with
78+
this by only looking at the comparison of our clustering results based on the points identified
79+
by DBSCAN as core points. We can see below that the differences between these two
80+
clusterings mostly occur in the boundaries of the clusters. This matches our
81+
intuition of stability within the core points.
82+
83+
.. image:: images/dbscan_from_hdbscan_comparision.png
84+
:align: center
85+
86+
For a slightly more empirical comparison we we make use of the `adjusted rand score <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html>`__
87+
to compare the clustering of the core points between a DBSCAN cluster from sklearn and
88+
a DBSCAN* clustering extracted from our HDBSCAN* object.
89+
90+
.. image:: images/dbscan_from_hdbscan_percentage_core.png
91+
:align: center
92+
93+
.. image:: images/dbscan_from_hdbscan_number_of_clusters.png
94+
:align: center
95+
96+
We see that for very small epsilon values our number of clusters tends to be quite
97+
far apart, largely due to a large number of the points being considered boundary points
98+
instead of core points. As the epsilon value increases, more and more points are
99+
considered core and the number of clusters generated by each algorithm converge.
100+
101+
Additionally, the adjusted rand score between the core points of both algorithm
102+
stays consistently high (mostly 1.0) for our entire range of epsilon. There may be
103+
be some minor discrepancies between core point results largely due to implementation
104+
details and optimizations with the code base.
105+
106+
Why might one just extract the DBSCAN* clustering results from a single HDBSCAN* run
107+
instead of making use of sklearns DBSSCAN code? The short answer is efficiency.
108+
If you aren't sure what epsilon parameter to select for DBSCAN then you may have to
109+
run the algorithm many times on your data set. While those runs can be inexpensive for
110+
very small epsilon values they can get quite expensive for large parameter values.
111+
112+
In this small benchmark case of 50,000 two dimensional data points we have broken even
113+
after having only had to try two epsilon parameters from DBSCAN, or only a single
114+
run with a large parameter selected. This trend is only exacerbated for larger
115+
data sets in higher dimensional spaces. For more detailed scaling experiments see
116+
`Accelearted Hierarchical Density Clustering <https://arxiv.org/abs/1705.07321>`__
117+
by McInnes and Healy.
118+
119+
.. image:: images/dbscan_from_hdbscan_timing.png
120+
:align: center
121+
122+
123+
124+
125+
126+

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ User Guide / Tutorial
2525
prediction_tutorial
2626
soft_clustering
2727
how_to_use_epsilon
28+
dbscan_from_hdbscan
2829
faq
2930

3031
Background on Clustering with HDBSCAN

hdbscan/_hdbscan_boruvka.pyx

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -423,7 +423,7 @@ cdef class KDTreeBoruvkaAlgorithm (object):
423423
else:
424424
datasets.append(np.asarray(self.tree.data[i*split_cnt:(i+1)*split_cnt]))
425425

426-
knn_data = Parallel(n_jobs=self.n_jobs)(
426+
knn_data = Parallel(n_jobs=self.n_jobs, max_nbytes=None)(
427427
delayed(_core_dist_query)
428428
(self.core_dist_tree, points,
429429
self.min_samples + 1)
@@ -454,8 +454,10 @@ cdef class KDTreeBoruvkaAlgorithm (object):
454454
# issues, but we'll get quite a few, and they are the hard ones to
455455
# get, so fill in any we can and then run update components.
456456
for n in range(self.num_points):
457-
for i in range(1, self.min_samples + 1):
457+
for i in range(0, self.min_samples + 1):
458458
m = knn_indices[n, i]
459+
if n == m:
460+
continue
459461
if self.core_distance[m] <= self.core_distance[n]:
460462
self.candidate_point[n] = n
461463
self.candidate_neighbor[n] = m
@@ -745,7 +747,7 @@ cdef class KDTreeBoruvkaAlgorithm (object):
745747
# then propagate the results of that computation
746748
# up the tree.
747749
new_bound = min(new_upper_bound,
748-
new_lower_bound + 2 * node1_info.radius)
750+
new_lower_bound + 2 * self.dist._dist_to_rdist(node1_info.radius))
749751
# new_bound = new_upper_bound
750752
if new_bound < self.bounds_ptr[node1]:
751753
self.bounds_ptr[node1] = new_bound
@@ -1025,36 +1027,39 @@ cdef class BallTreeBoruvkaAlgorithm (object):
10251027
else:
10261028
datasets.append(np.asarray(self.tree.data[i*split_cnt:(i+1)*split_cnt]))
10271029

1028-
knn_data = Parallel(n_jobs=self.n_jobs)(
1030+
knn_data = Parallel(n_jobs=self.n_jobs, max_nbytes=None)(
10291031
delayed(_core_dist_query)
10301032
(self.core_dist_tree, points,
1031-
self.min_samples)
1033+
self.min_samples + 1)
10321034
for points in datasets)
10331035
knn_dist = np.vstack([x[0] for x in knn_data])
10341036
knn_indices = np.vstack([x[1] for x in knn_data])
10351037
else:
10361038
knn_dist, knn_indices = self.core_dist_tree.query(
10371039
self.tree.data,
1038-
k=self.min_samples,
1040+
k=self.min_samples + 1,
10391041
dualtree=True,
10401042
breadth_first=True)
10411043

1042-
self.core_distance_arr = knn_dist[:, self.min_samples - 1].copy()
1044+
self.core_distance_arr = knn_dist[:, self.min_samples].copy()
10431045
self.core_distance = (<np.double_t[:self.num_points:1]> (
10441046
<np.double_t *> self.core_distance_arr.data))
10451047

10461048
# Since we already computed NN distances for the min_samples closest
10471049
# points we can use this to do the first round of boruvka -- we won't
10481050
# get every point due to core_distance/mutual reachability distance
10491051
# issues, but we'll get quite a few, and they are the hard ones to get,
1050-
# so fill in any we ca and then run update components.
1052+
# so fill in any we can and then run update components.
10511053
for n in range(self.num_points):
1052-
for i in range(self.min_samples - 1, 0):
1054+
for i in range(0, self.min_samples + 1):
10531055
m = knn_indices[n, i]
1056+
if n == m:
1057+
continue
10541058
if self.core_distance[m] <= self.core_distance[n]:
10551059
self.candidate_point[n] = n
10561060
self.candidate_neighbor[n] = m
10571061
self.candidate_distance[n] = self.core_distance[n]
1062+
break
10581063

10591064
self.update_components()
10601065

0 commit comments

Comments
 (0)