More work on the README, and reference comparison notebook.

lmcinnes · lmcinnes · commit e360a583481e · 2015-11-28T16:44:34.000-05:00
diff --git a/README.rst b/README.rst
@@ -8,6 +8,14 @@ the result to find a clustering that gives the best stability over epsilon.
 This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN),
 and be more robust to parameter selection.
 
+In practice this means that HDBSCAN returns a good clustering straight
+away with little or no parameter tuning -- and the primary parameter,
+minimum cluster size, is intuitive and easy to select.
+
+HDBSCAN is ideal for exploratory data analysis; it's a fast and robust
+algorithm that you can trust to return meaningful clusters (if there
+are any).
+
 Based on the paper:
     R. Campello, D. Moulavi, and J. Sander, *Density-Based Clustering Based on
     Hierarchical Density Estimates*
@@ -23,7 +31,7 @@ How to use HDBSCAN
 The hdbscan package inherits from sklearn classes, and thus drops in neatly
 next to other sklearn clusterers with an identical calling API. Similarly it
 supports input in a variety of formats: an array (or pandas dataframe, or
-sparse matrix) of shape `(num_samples x num_features)`; an array (or sparse matrix)
+sparse matrix) of shape ``(num_samples x num_features)``; an array (or sparse matrix)
 giving a distance matrix between samples.
 
 .. code:: python
@@ -107,7 +115,7 @@ Fast install, presuming you have sklearn and all its requirements installed:
     pip install hdbscan
 
 If pip is having difficulties pulling the dependencies then we'd suggest installing
-the dependencies manually using anaconda followed by pulling hdscan from pip:
+the dependencies manually using anaconda followed by pulling hdbscan from pip:
 
 .. code:: bash
 
diff --git a/notebooks/Python vs Java.ipynb b/notebooks/Python vs Java.ipynb
@@ -17,7 +17,7 @@
    "source": [
     "## Some quick background\n",
     "\n",
-    "In 2013 Campello, Moulavi and Sander published a paper on a new clustering algorithm that they called HDBSCAN. In mid-2014 I was doing some general research on the current state of clustering, particularly with regard to exploratory data analysis. At the time DBSCAN or OPTICS appeared to be the most promising algorithm available. A colleague ran across the HDBSCAN paper in her literature survey, and suggested we look into how well it performed. We spent an afternoon learning the algorithm and coding it up and found that it gave remarkably good results for the range of test data we had. Things stayed in that state for some time, with the intention being to use a good HDBSCAN implementation when one became available. By early 2015 our needs for clustering grew and, having no good implementation of HDBSCAN to hand I set about writing our own. Since the first version, coded up in an afternoon, had been in python I stuck with that choice -- but obviously performance might be an issue. In July 2015, after our implementation was well underway Campello, Moulavi and Sander published a new HDBSCAN paper, and released Java code to peform HDBSCAN clustering. Since one of our goals had been to get good scaling it became necessary to see how our python version compared to the high performance reference implementation in Java. \n",
+    "In 2013 Campello, Moulavi and Sander published a paper on a new clustering algorithm that they called HDBSCAN. In mid-2014 I was doing some general research on the current state of clustering, particularly with regard to exploratory data analysis. At the time DBSCAN or OPTICS appeared to be the most promising algorithm available. A colleague ran across the HDBSCAN paper in her literature survey, and suggested we look into how well it performed. We spent an afternoon learning the algorithm and coding it up and found that it gave remarkably good results for the range of test data we had. Things stayed in that state for some time, with the intention being to use a good HDBSCAN implementation when one became available. By early 2015 our needs for clustering grew and, having no good implementation of HDBSCAN to hand, I set about writing our own. Since the first version, coded up in an afternoon, had been in python I stuck with that choice -- but obviously performance might be an issue. In July 2015, after our implementation was well underway Campello, Moulavi and Sander published a new HDBSCAN paper, and released Java code to peform HDBSCAN clustering. Since one of our goals had been to get good scaling it became necessary to see how our python version compared to the high performance reference implementation in Java. \n",
     "\n",
     "This is the story of how our codebase evolved and was optimized, and how it compares with the Java version at different stages of that journey.\n",
     "\n",