Skip to content

Commit e360a58

Browse files
committed
More work on the README, and reference comparison notebook.
1 parent b3fd343 commit e360a58

File tree

2 files changed

+11
-3
lines changed

2 files changed

+11
-3
lines changed

README.rst

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,14 @@ the result to find a clustering that gives the best stability over epsilon.
88
This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN),
99
and be more robust to parameter selection.
1010

11+
In practice this means that HDBSCAN returns a good clustering straight
12+
away with little or no parameter tuning -- and the primary parameter,
13+
minimum cluster size, is intuitive and easy to select.
14+
15+
HDBSCAN is ideal for exploratory data analysis; it's a fast and robust
16+
algorithm that you can trust to return meaningful clusters (if there
17+
are any).
18+
1119
Based on the paper:
1220
R. Campello, D. Moulavi, and J. Sander, *Density-Based Clustering Based on
1321
Hierarchical Density Estimates*
@@ -23,7 +31,7 @@ How to use HDBSCAN
2331
The hdbscan package inherits from sklearn classes, and thus drops in neatly
2432
next to other sklearn clusterers with an identical calling API. Similarly it
2533
supports input in a variety of formats: an array (or pandas dataframe, or
26-
sparse matrix) of shape `(num_samples x num_features)`; an array (or sparse matrix)
34+
sparse matrix) of shape ``(num_samples x num_features)``; an array (or sparse matrix)
2735
giving a distance matrix between samples.
2836

2937
.. code:: python
@@ -107,7 +115,7 @@ Fast install, presuming you have sklearn and all its requirements installed:
107115
pip install hdbscan
108116
109117
If pip is having difficulties pulling the dependencies then we'd suggest installing
110-
the dependencies manually using anaconda followed by pulling hdscan from pip:
118+
the dependencies manually using anaconda followed by pulling hdbscan from pip:
111119

112120
.. code:: bash
113121

notebooks/Python vs Java.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
"source": [
1818
"## Some quick background\n",
1919
"\n",
20-
"In 2013 Campello, Moulavi and Sander published a paper on a new clustering algorithm that they called HDBSCAN. In mid-2014 I was doing some general research on the current state of clustering, particularly with regard to exploratory data analysis. At the time DBSCAN or OPTICS appeared to be the most promising algorithm available. A colleague ran across the HDBSCAN paper in her literature survey, and suggested we look into how well it performed. We spent an afternoon learning the algorithm and coding it up and found that it gave remarkably good results for the range of test data we had. Things stayed in that state for some time, with the intention being to use a good HDBSCAN implementation when one became available. By early 2015 our needs for clustering grew and, having no good implementation of HDBSCAN to hand I set about writing our own. Since the first version, coded up in an afternoon, had been in python I stuck with that choice -- but obviously performance might be an issue. In July 2015, after our implementation was well underway Campello, Moulavi and Sander published a new HDBSCAN paper, and released Java code to peform HDBSCAN clustering. Since one of our goals had been to get good scaling it became necessary to see how our python version compared to the high performance reference implementation in Java. \n",
20+
"In 2013 Campello, Moulavi and Sander published a paper on a new clustering algorithm that they called HDBSCAN. In mid-2014 I was doing some general research on the current state of clustering, particularly with regard to exploratory data analysis. At the time DBSCAN or OPTICS appeared to be the most promising algorithm available. A colleague ran across the HDBSCAN paper in her literature survey, and suggested we look into how well it performed. We spent an afternoon learning the algorithm and coding it up and found that it gave remarkably good results for the range of test data we had. Things stayed in that state for some time, with the intention being to use a good HDBSCAN implementation when one became available. By early 2015 our needs for clustering grew and, having no good implementation of HDBSCAN to hand, I set about writing our own. Since the first version, coded up in an afternoon, had been in python I stuck with that choice -- but obviously performance might be an issue. In July 2015, after our implementation was well underway Campello, Moulavi and Sander published a new HDBSCAN paper, and released Java code to peform HDBSCAN clustering. Since one of our goals had been to get good scaling it became necessary to see how our python version compared to the high performance reference implementation in Java. \n",
2121
"\n",
2222
"This is the story of how our codebase evolved and was optimized, and how it compares with the Java version at different stages of that journey.\n",
2323
"\n",

0 commit comments

Comments
 (0)