You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Documentation, including tutorials, are available on ReadTheDocs at http://hdbscan.readthedocs.io/en/latest/ .
47
47
48
-
Notebooks `comparing HDBSCAN to other clustering algorithms <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`_, explaining `how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ and `comparing performance with other python clustering implementations <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations-v0.7.ipynb>`_ are available.
48
+
Notebooks `comparing HDBSCAN to other clustering algorithms <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`_, explaining `how HDBSCAN works <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ and `comparing performance with other python clustering implementations <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations-v0.7.ipynb>`_ are available.
49
49
50
50
------------------
51
51
How to use HDBSCAN
@@ -69,10 +69,10 @@ Performance
69
69
-----------
70
70
71
71
Significant effort has been put into making the hdbscan implementation as fast as
72
-
possible. It is `orders of magnitude faster than the reference implementation <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Python%20vs%20Java.ipynb>`_ in Java,
72
+
possible. It is `orders of magnitude faster than the reference implementation <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Python%20vs%20Java.ipynb>`_ in Java,
73
73
and is currently faster than highly optimized single linkage implementations in C and C++.
74
-
`version 0.7 performance can be seen in this notebook <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations-v0.7.ipynb>`_ .
75
-
In particular `performance on low dimensional data is better than sklearn's DBSCAN <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations%202D%20v0.7.ipynb>`_ ,
74
+
`version 0.7 performance can be seen in this notebook <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations-v0.7.ipynb>`_ .
75
+
In particular `performance on low dimensional data is better than sklearn's DBSCAN <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations%202D%20v0.7.ipynb>`_ ,
76
76
and via support for caching with joblib, re-clustering with different parameters
77
77
can be almost free.
78
78
@@ -90,7 +90,7 @@ object has attributes for:
90
90
91
91
All of which come equipped with methods for plotting and converting
92
92
to Pandas or NetworkX for further analysis. See the notebook on
93
-
`how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ for examples and further details.
93
+
`how HDBSCAN works <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ for examples and further details.
94
94
95
95
The clusterer objects also have an attribute providing cluster membership
96
96
strengths, resulting in optional soft clustering (and no further compute
@@ -173,7 +173,7 @@ For a manual install get this package:
Copy file name to clipboardExpand all lines: notebooks/Benchmarking scalability of clustering implementations-v0.7.ipynb
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -23,7 +23,7 @@
23
23
" * Agglomerative clustering\n",
24
24
"* [Fastcluster](http://danifold.net/fastcluster.html) (which provides very fast agglomerative clustering in C++)\n",
25
25
"* [DeBaCl](https://github.com/CoAxLab/DeBaCl) (Density Based Clustering; similar to a mix of DBSCAN and Agglomerative)\n",
26
-
"* [HDBSCAN](https://github.com/lmcinnes/hdbscan) (A robust hierarchical version of DBSCAN)\n",
26
+
"* [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan) (A robust hierarchical version of DBSCAN)\n",
27
27
"\n",
28
28
"Obviously a major factor in performance will be the algorithm itself. Some algorithms are simply slower -- often, but not always, because they are doing more work to provide a better clustering."
29
29
]
@@ -568,7 +568,7 @@
568
568
"source": [
569
569
"If we're looking for scaling we can write off the scipy single linkage implementation -- if even we didn't hit the RAM limit the $O(n^2)$ scaling is going to quickly catch up with us. Fastcluster has the same asymptotic scaling, but is heavily optimized to being the constant down much lower -- at this point it is still keeping close to the faster algorithms. It's asymtotics will still catch up with it eventually however.\n",
570
570
"\n",
571
-
"In practice this is going to mean that for larger datasets you are going to be very constrained in what algorithms you can apply: if you get enough datapoints only K-Means, DBSCAN, and HDBSCAN will be left. This is somewhat disappointing, paritcularly as [K-Means is not a particularly good clustering algorithm](http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb), paricularly for exploratory data analysis.\n",
571
+
"In practice this is going to mean that for larger datasets you are going to be very constrained in what algorithms you can apply: if you get enough datapoints only K-Means, DBSCAN, and HDBSCAN will be left. This is somewhat disappointing, paritcularly as [K-Means is not a particularly good clustering algorithm](http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb), paricularly for exploratory data analysis.\n",
572
572
"\n",
573
573
"With this in mind it is worth looking at how these last several implementations perform at much larger sizes, to see, for example, when fastscluster starts to have its asymptotic complexity start to pull it away."
574
574
]
@@ -863,7 +863,7 @@
863
863
"source": [
864
864
"## Conclusions\n",
865
865
"\n",
866
-
"Performance obviously depends on the algorithm chosen, but can also vary significantly upon the specific implementation (HDBSCAN is far better hierarchical density based clustering than DeBaCl, and sklearn has by far the best K-Means implementation). For anything beyond toy datasets, however, your algorithm options are greatly constrained. In my (obviously biased) opinion [HDBSCAN is the best algorithm for clustering](http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb). If you need to cluster data beyond the scope that HDBSCAN can reasonably handle then the only algorithm options on the table are DBSCAN and K-Means; DBSCAN is the slower of the two, especially for very large data, but K-Means clustering can be remarkably poor -- it's a tough choice."
866
+
"Performance obviously depends on the algorithm chosen, but can also vary significantly upon the specific implementation (HDBSCAN is far better hierarchical density based clustering than DeBaCl, and sklearn has by far the best K-Means implementation). For anything beyond toy datasets, however, your algorithm options are greatly constrained. In my (obviously biased) opinion [HDBSCAN is the best algorithm for clustering](http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb). If you need to cluster data beyond the scope that HDBSCAN can reasonably handle then the only algorithm options on the table are DBSCAN and K-Means; DBSCAN is the slower of the two, especially for very large data, but K-Means clustering can be remarkably poor -- it's a tough choice."
Copy file name to clipboardExpand all lines: notebooks/Comparing Clustering Algorithms.ipynb
+6-4Lines changed: 6 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -450,7 +450,7 @@
450
450
"* **Stability**: HDBSCAN is stable over runs and subsampling (since the variable density clustering will still cluster sparser subsampled clusters with the same parameter choices), and has good stability over parameter choices.\n",
451
451
"* **Performance**: When implemented well HDBSCAN can be very efficient. The current implementation has similar performance to `fastcluster`'s agglomerative clustering (and will use `fastcluster` if it is available), but we expect future implementations that take advantage of newer data structure such as cover trees to scale significantly better.\n",
452
452
"\n",
453
-
"How does HDBSCAN perform on our test dataset? Unfortunately HDBSCAN is not part of `sklearn`. Fortunately we can just import the [hdbscan library](https://github.com/lmcinnes/hdbscan) and use it as if it were part of `sklearn`."
453
+
"How does HDBSCAN perform on our test dataset? Unfortunately HDBSCAN is not part of `sklearn`. Fortunately we can just import the [hdbscan library](https://github.com/scikit-learn-contrib/hdbscan) and use it as if it were part of `sklearn`."
Copy file name to clipboardExpand all lines: notebooks/Python vs Java.ipynb
+6-4Lines changed: 6 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@
21
21
"\n",
22
22
"This is the story of how our codebase evolved and was optimized, and how it compares with the Java version at different stages of that journey.\n",
23
23
"\n",
24
-
"To make the comparisons we'll need data on runtimes of both algorithms, ranging over dataset size, and dataset dimension. To save time and space I've done that work in [another notebook](http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Performance%20data%20generation%20.ipynb) and will just load the data in here."
24
+
"To make the comparisons we'll need data on runtimes of both algorithms, ranging over dataset size, and dataset dimension. To save time and space I've done that work in [another notebook](http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Performance%20data%20generation%20.ipynb) and will just load the data in here."
0 commit comments