You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.rst
+61-5Lines changed: 61 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,8 +14,7 @@ Based on the paper:
14
14
In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172.
15
15
2013
16
16
17
-
Notebooks `comparing HDBSCAN to other clustering algorithms <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`_,
18
-
and explaining `how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ are available.
17
+
Notebooks `comparing HDBSCAN to other clustering algorithms <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`_, explaining `how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ and `comparing performance with other python clustering implementations <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations.ipynb>`_ are available.
19
18
20
19
------------------
21
20
How to use HDBSCAN
@@ -34,9 +33,66 @@ giving a distance matrix between samples.
34
33
clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
35
34
cluster_labels = clusterer.fit_predict(data)
36
35
37
-
Note that clustering larger datasets will require significant memory
38
-
(as with any algorithm that needs all pairwise distances). Support for
39
-
low memory/better scaling is planned but not yet implemented.
36
+
-----------
37
+
Performance
38
+
-----------
39
+
40
+
Significant effort has been put into making the hdbscan implementation as fast as
41
+
possible. It is more than twice as fast as the reference implementation in Java
42
+
and is competitive with highly optimized single linkage implementations in C and C++.
43
+
`current performance can be seen in this notebook <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations.ipynb>`_
44
+
and further performance improvements should be forthcoming in the next few releases.
45
+
46
+
------------------------
47
+
Additional functionality
48
+
------------------------
49
+
50
+
The hdbscan package comes equipped with visualization tools to help you
51
+
understand your clustering results. After fitting data the clusterer
52
+
object has attributes for:
53
+
54
+
* The condensed cluster hierarchy
55
+
* The robust single linkage cluster hierarchy
56
+
* The reachability distance minimal spanning tree
57
+
58
+
All of which come equipped with methods for plotting and converting
59
+
to Pandas or NetworkX for further analysis. See the notebook on
60
+
`how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ for examples and further details.
61
+
62
+
The clusterer objects also have an attribute providing cluster membership
63
+
strengths, resulting in optional soft clustering (and no further compute
64
+
expense)
65
+
66
+
---------------------
67
+
Robust single linkage
68
+
---------------------
69
+
70
+
The hdbscan package also provides support for the *robust single linkage*
71
+
clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN
72
+
implementation this is a high performance version of the algorithm
73
+
outperforming scipy's standard single linkage implementation. The
74
+
robust single linkage hierarchy is available as an attribute of
75
+
the robust single linkage clusterer, again with the ability to plot
76
+
or export the hierarchy, and to extract flat clusterings at a given
Copy file name to clipboardExpand all lines: notebooks/Benchmarking scalability of clustering implementations.ipynb
+15-6Lines changed: 15 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -745,7 +745,7 @@
745
745
"cell_type": "markdown",
746
746
"metadata": {},
747
747
"source": [
748
-
"Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation. I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. Note how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
748
+
"Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation."
749
749
]
750
750
},
751
751
{
@@ -888,6 +888,15 @@
888
888
"datasize_table"
889
889
]
890
890
},
891
+
{
892
+
"cell_type": "markdown",
893
+
"metadata": {},
894
+
"source": [
895
+
"I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. It is also worth keeping in mind that some of the results of the models for larger numbers are simply false -- you'll recall that Fastcluster and Scipy's single linkage both didn't scale at all well past 40000 points on my laptop, so I'm certainly not ging to manage 50000 or 100000 over lunch. The same applies to DeBaCl and the slower Sklearn implementations as they also produce the full pairwise distance matrix during computations.\n",
896
+
"\n",
897
+
"The main thing to note is how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
0 commit comments