Skip to content

Commit 1abf7c8

Browse files
committed
Merge remote-tracking branch 'origin/master'
2 parents 4b52a3a + 9f7770e commit 1abf7c8

File tree

7 files changed

+143
-1076
lines changed

7 files changed

+143
-1076
lines changed

README.rst

Lines changed: 61 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,7 @@ Based on the paper:
1414
In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172.
1515
2013
1616

17-
Notebooks `comparing HDBSCAN to other clustering algorithms <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`_,
18-
and explaining `how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ are available.
17+
Notebooks `comparing HDBSCAN to other clustering algorithms <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`_, explaining `how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ and `comparing performance with other python clustering implementations <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations.ipynb>`_ are available.
1918

2019
------------------
2120
How to use HDBSCAN
@@ -34,9 +33,66 @@ giving a distance matrix between samples.
3433
clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
3534
cluster_labels = clusterer.fit_predict(data)
3635
37-
Note that clustering larger datasets will require significant memory
38-
(as with any algorithm that needs all pairwise distances). Support for
39-
low memory/better scaling is planned but not yet implemented.
36+
-----------
37+
Performance
38+
-----------
39+
40+
Significant effort has been put into making the hdbscan implementation as fast as
41+
possible. It is more than twice as fast as the reference implementation in Java
42+
and is competitive with highly optimized single linkage implementations in C and C++.
43+
`current performance can be seen in this notebook <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations.ipynb>`_
44+
and further performance improvements should be forthcoming in the next few releases.
45+
46+
------------------------
47+
Additional functionality
48+
------------------------
49+
50+
The hdbscan package comes equipped with visualization tools to help you
51+
understand your clustering results. After fitting data the clusterer
52+
object has attributes for:
53+
54+
* The condensed cluster hierarchy
55+
* The robust single linkage cluster hierarchy
56+
* The reachability distance minimal spanning tree
57+
58+
All of which come equipped with methods for plotting and converting
59+
to Pandas or NetworkX for further analysis. See the notebook on
60+
`how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ for examples and further details.
61+
62+
The clusterer objects also have an attribute providing cluster membership
63+
strengths, resulting in optional soft clustering (and no further compute
64+
expense)
65+
66+
---------------------
67+
Robust single linkage
68+
---------------------
69+
70+
The hdbscan package also provides support for the *robust single linkage*
71+
clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN
72+
implementation this is a high performance version of the algorithm
73+
outperforming scipy's standard single linkage implementation. The
74+
robust single linkage hierarchy is available as an attribute of
75+
the robust single linkage clusterer, again with the ability to plot
76+
or export the hierarchy, and to extract flat clusterings at a given
77+
cut level and gamma value.
78+
79+
Example usage:
80+
81+
.. code:: python
82+
83+
import hdbscan
84+
85+
clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
86+
cluster_labels = clusterer.fit_predict(data)
87+
hierarchy = clusterer.cluster_hierarchy_
88+
alt_labels = hierarchy.get_clusters(0.100, 5)
89+
hierarchy.plot()
90+
91+
92+
Based on the paper:
93+
K. Chaudhuri and S. Dasgupta.
94+
*"Rates of convergence for the cluster tree."*
95+
In Advances in Neural Information Processing Systems, 2010.
4096

4197
----------
4298
Installing

hdbscan/hdbscan_.py

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -142,17 +142,7 @@ def _hdbscan_large_kdtree(X, min_cluster_size=5, min_samples=None, alpha=1.0,
142142
mutual_reachability_ = kdtree_pdist_mutual_reachability(X, metric, p, min_samples, alpha)
143143

144144
min_spanning_tree = mst_linkage_core_pdist(mutual_reachability_)
145-
146-
if gen_min_span_tree:
147-
result_min_span_tree = min_spanning_tree.copy()
148-
for index, row in enumerate(result_min_span_tree[1:], 1):
149-
candidates = np.where(np.isclose(mutual_reachability_[row[1]], row[2]))[0]
150-
candidates = np.intersect1d(candidates, min_spanning_tree[:index, :2].astype(int))
151-
candidates = candidates[candidates != row[1]]
152-
assert(len(candidates) > 0)
153-
row[0] = candidates[0]
154-
else:
155-
result_min_span_tree = None
145+
result_min_span_tree = None
156146

157147
min_spanning_tree = min_spanning_tree[np.argsort(min_spanning_tree.T[2]), :]
158148

hdbscan/robust_single_linkage_.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -291,8 +291,9 @@ class RobustSingleLinkage (BaseEstimator, ClusterMixin):
291291
292292
"""
293293

294-
def __init__(self, k=5, alpha=1.4142135623730951, gamma=5, metric='euclidean', p=None):
294+
def __init__(self, cut=0.25, k=5, alpha=1.4142135623730951, gamma=5, metric='euclidean', p=None):
295295

296+
self.cut = cut
296297
self.k = k
297298
self.alpha = alpha
298299
self.gamma = gamma

notebooks/Benchmarking scalability of clustering implementations.ipynb

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -745,7 +745,7 @@
745745
"cell_type": "markdown",
746746
"metadata": {},
747747
"source": [
748-
"Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation. I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. Note how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
748+
"Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation."
749749
]
750750
},
751751
{
@@ -888,6 +888,15 @@
888888
"datasize_table"
889889
]
890890
},
891+
{
892+
"cell_type": "markdown",
893+
"metadata": {},
894+
"source": [
895+
"I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. It is also worth keeping in mind that some of the results of the models for larger numbers are simply false -- you'll recall that Fastcluster and Scipy's single linkage both didn't scale at all well past 40000 points on my laptop, so I'm certainly not ging to manage 50000 or 100000 over lunch. The same applies to DeBaCl and the slower Sklearn implementations as they also produce the full pairwise distance matrix during computations.\n",
896+
"\n",
897+
"The main thing to note is how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
898+
]
899+
},
891900
{
892901
"cell_type": "markdown",
893902
"metadata": {},
@@ -911,21 +920,21 @@
911920
],
912921
"metadata": {
913922
"kernelspec": {
914-
"display_name": "Python 2",
923+
"display_name": "Python 3",
915924
"language": "python",
916-
"name": "python2"
925+
"name": "python3"
917926
},
918927
"language_info": {
919928
"codemirror_mode": {
920929
"name": "ipython",
921-
"version": 2
930+
"version": 3
922931
},
923932
"file_extension": ".py",
924933
"mimetype": "text/x-python",
925934
"name": "python",
926935
"nbconvert_exporter": "python",
927-
"pygments_lexer": "ipython2",
928-
"version": "2.7.10"
936+
"pygments_lexer": "ipython3",
937+
"version": "3.4.3"
929938
}
930939
},
931940
"nbformat": 4,

notebooks/Comparing Clustering Algorithms.ipynb

Lines changed: 63 additions & 14 deletions
Large diffs are not rendered by default.

notebooks/Digits clustering comparisons.ipynb

Lines changed: 0 additions & 1038 deletions
This file was deleted.

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ def readme():
2828

2929
configuration = {
3030
'name' : 'hdbscan',
31-
'version' : '0.2',
31+
'version' : '0.3.1',
3232
'description' : 'Clustering based on density with variable density clusters',
3333
'long_description' : readme(),
3434
'classifiers' : [

0 commit comments

Comments
 (0)