Merge remote-tracking branch 'origin/master'

lmcinnes · lmcinnes · commit 1abf7c87f894 · 2015-11-08T16:01:00.000-05:00
diff --git a/README.rst b/README.rst
@@ -14,8 +14,7 @@ Based on the paper:
     In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172.
     2013
     
-Notebooks `comparing HDBSCAN to other clustering algorithms <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`_, 
-and explaining `how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ are available.
+Notebooks `comparing HDBSCAN to other clustering algorithms <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`_, explaining `how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ and `comparing performance with other python clustering implementations <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations.ipynb>`_ are available.
 
 ------------------
 How to use HDBSCAN
@@ -34,9 +33,66 @@ giving a distance matrix between samples.
     clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
     cluster_labels = clusterer.fit_predict(data)
 
-Note that clustering larger datasets will require significant memory
-(as with any algorithm that needs all pairwise distances). Support for
-low memory/better scaling is planned but not yet implemented.
+-----------
+Performance
+-----------
+
+Significant effort has been put into making the hdbscan implementation as fast as 
+possible. It is more than twice as fast as the reference implementation in Java
+and is competitive with highly optimized single linkage implementations in C and C++.
+`current performance can be seen in this notebook <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Benchmarking%20scalability%20of%20clustering%20implementations.ipynb>`_ 
+and further performance improvements should be forthcoming in the next few releases.
+
+------------------------
+Additional functionality
+------------------------
+
+The hdbscan package comes equipped with visualization tools to help you
+understand your clustering results. After fitting data the clusterer
+object has attributes for:
+
+* The condensed cluster hierarchy
+* The robust single linkage cluster hierarchy
+* The reachability distance minimal spanning tree
+
+All of which come equipped with methods for plotting and converting
+to Pandas or NetworkX for further analysis. See the notebook on
+`how HDBSCAN works <http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/How%20HDBSCAN%20Works.ipynb>`_ for examples and further details.
+
+The clusterer objects also have an attribute providing cluster membership
+strengths, resulting in optional soft clustering (and no further compute 
+expense)
+
+---------------------
+Robust single linkage
+---------------------
+
+The hdbscan package also provides support for the *robust single linkage*
+clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN 
+implementation this is a high performance version of the algorithm 
+outperforming scipy's standard single linkage implementation. The
+robust single linkage hierarchy is available as an attribute of
+the robust single linkage clusterer, again with the ability to plot
+or export the hierarchy, and to extract flat clusterings at a given
+cut level and gamma value.
+
+Example usage:
+
+.. code:: python
+
+    import hdbscan
+    
+    clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
+    cluster_labels = clusterer.fit_predict(data)
+    hierarchy = clusterer.cluster_hierarchy_
+    alt_labels = hierarchy.get_clusters(0.100, 5)
+    hierarchy.plot()
+
+
+Based on the paper:
+    K. Chaudhuri and S. Dasgupta.
+    *"Rates of convergence for the cluster tree."*
+    In Advances in Neural Information Processing Systems, 2010.
 
 ----------
 Installing
diff --git a/hdbscan/hdbscan_.py b/hdbscan/hdbscan_.py
@@ -142,17 +142,7 @@ def _hdbscan_large_kdtree(X, min_cluster_size=5, min_samples=None, alpha=1.0,
     mutual_reachability_ = kdtree_pdist_mutual_reachability(X, metric, p, min_samples, alpha)
 
     min_spanning_tree = mst_linkage_core_pdist(mutual_reachability_)
-
-    if gen_min_span_tree:
-        result_min_span_tree = min_spanning_tree.copy()
-        for index, row in enumerate(result_min_span_tree[1:], 1):
-            candidates = np.where(np.isclose(mutual_reachability_[row[1]], row[2]))[0]
-            candidates = np.intersect1d(candidates, min_spanning_tree[:index, :2].astype(int))
-            candidates = candidates[candidates != row[1]]
-            assert(len(candidates) > 0)
-            row[0] = candidates[0]
-    else:
-        result_min_span_tree = None
+    result_min_span_tree = None
 
     min_spanning_tree = min_spanning_tree[np.argsort(min_spanning_tree.T[2]), :]
 
diff --git a/hdbscan/robust_single_linkage_.py b/hdbscan/robust_single_linkage_.py
@@ -291,8 +291,9 @@ class RobustSingleLinkage (BaseEstimator, ClusterMixin):
 
     """
 
-    def __init__(self, k=5, alpha=1.4142135623730951, gamma=5, metric='euclidean', p=None):
+    def __init__(self, cut=0.25, k=5, alpha=1.4142135623730951, gamma=5, metric='euclidean', p=None):
 
+        self.cut = cut
         self.k = k
         self.alpha = alpha
         self.gamma = gamma
diff --git a/notebooks/Benchmarking scalability of clustering implementations.ipynb b/notebooks/Benchmarking scalability of clustering implementations.ipynb
@@ -745,7 +745,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation. I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. Note how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
+    "Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation."
    ]
   },
   {
@@ -888,6 +888,15 @@
     "datasize_table"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. It is also worth keeping in mind that some of the results of the models for larger numbers are simply false -- you'll recall that Fastcluster and Scipy's single linkage both didn't scale at all well past 40000 points on my laptop, so I'm certainly not ging to manage 50000 or 100000 over lunch. The same applies to DeBaCl and the slower Sklearn implementations as they also produce the full pairwise distance matrix during computations.\n",
+    "\n",
+    "The main thing to note is how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -911,21 +920,21 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 2",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "python2"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.10"
+   "pygments_lexer": "ipython3",
+   "version": "3.4.3"
   }
  },
  "nbformat": 4,
diff --git a/notebooks/Comparing Clustering Algorithms.ipynb b/notebooks/Comparing Clustering Algorithms.ipynb
diff --git a/notebooks/Digits clustering comparisons.ipynb b/notebooks/Digits clustering comparisons.ipynb
diff --git a/setup.py b/setup.py
@@ -28,7 +28,7 @@ def readme():
 
 configuration = {
     'name' : 'hdbscan',
-    'version' : '0.2',
+    'version' : '0.3.1',
     'description' : 'Clustering based on density with variable density clusters',
     'long_description' : readme(),
     'classifiers' : [