Merge remote-tracking branch 'origin/master'

lmcinnes · lmcinnes · commit 2212af542af9 · 2018-04-26T22:19:13.000-04:00
# Conflicts:
#	setup.py
diff --git a/README.rst b/README.rst
@@ -218,7 +218,25 @@ Install the package
 .. code:: bash
 
     python setup.py install
-    
+
+-----------------
+Running the Tests
+-----------------
+
+The package tests can be run after installation using the command:
+
+.. code:: bash
+
+    nosetests -s hdbscan
+
+or, if ``nose`` is installed but ``nosetests`` is not in your ``PATH`` variable:
+
+.. code:: bash
+
+    python -m nose -s hdbscan
+
+If one or more of the tests fail, please report a bug at https://github.com/scikit-learn-contrib/hdbscan/issues/new
+
 --------------
 Python Version
 --------------
diff --git a/docs/advanced_hdbscan.rst b/docs/advanced_hdbscan.rst
@@ -3,14 +3,13 @@ Getting More Information About a Clustering
 ===========================================
 
 Once you have the basics of clustering sorted you may want to dig a
-little deeper than just the cluster labels returned to you. Fortunately
-the hdbscan library provides you with the facilities to do this. During
+little deeper than just the cluster labels returned to you. Fortunately, the hdbscan library provides you with the facilities to do this. During
 processing HDBSCAN\* builds a hierarchy of potential clusters, from
 which it extracts the flat clustering returned. It can be informative to
 look at that hierarchy, and potentially make use of the extra
 information contained therein.
 
-Suppose we have a dataset for clustering. It is a binary file in nunpy format and it can be found at https://github.com/lmcinnes/hdbscan/blob/master/notebooks/clusterable_data.npy.
+Suppose we have a dataset for clustering. It is a binary file in NumPy format and it can be found at https://github.com/lmcinnes/hdbscan/blob/master/notebooks/clusterable_data.npy.
 
 .. code:: python
 
@@ -116,7 +115,7 @@ each branch representing the number of points in the cluster at that
 level. If we wish to know which branches were selected by the HDBSCAN\*
 algorithm we can pass ``select_clusters=True``. You can even pass a
 selection palette to color the selections according to the cluster
-labelling.
+labeling.
 
 .. code:: python
 
@@ -127,8 +126,8 @@ labelling.
 .. image:: images/advanced_hdbscan_11_1.png
 
 
-From this we can see, for example, that the yellow cluster, at the
-center of the plot, forms early (breaking off from the pale blue and
+From this, we can see, for example, that the yellow cluster at the
+center of the plot forms early (breaking off from the pale blue and
 purple clusters) and persists for a long time. By comparison the green
 cluster, which also forms early, quickly breaks apart and then
 vanishes altogether (shattering into clusters all smaller than the
@@ -141,7 +140,7 @@ for example, in the dark blue cluster.
 
 If this was a simple visual analysis of the condensed tree can tell you
 a lot more about the structure of your data. This is not all we can do
-with condensed trees however. For larger and more complex datasets the
+with condensed trees, however. For larger and more complex datasets the
 tree itself may be very complex, and it may be desirable to run more
 interesting analytics over the tree itself. This can be achieved via
 several converter methods: :py:meth:`~hdbscan.plots.CondensedTree.to_networkx`, :py:meth:`~hdbscan.plots.CondensedTree.to_pandas`, and
@@ -162,9 +161,9 @@ First we'll consider :py:meth:`~hdbscan.plots.CondensedTree.to_networkx`
 
 
 
-As you can see we get a networkx directed graph, which we can then use
-all the regular networkx tools and analytics on. The graph is richer
-than the visual plot above may lead you to believe however:
+As you can see we get a NetworkX directed graph, which we can then use
+all the regular NetworkX tools and analytics on. The graph is richer
+than the visual plot above may lead you to believe, however:
 
 .. code:: python
 
@@ -182,12 +181,12 @@ than the visual plot above may lead you to believe however:
 
 The graph actually contains nodes for all the points falling out of
 clusters as well as the clusters themselves. Each node has an associated
-``size`` attribute, and each edge has a ``weight`` of the lambda value
+``size`` attribute and each edge has a ``weight`` of the lambda value
 at which that edge forms. This allows for much more interesting
 analyses.
 
-Next we have the :py:meth:`~hdbscan.plots.CondensedTree.to_pandas` method, which returns a panda dataframe
-where each row corresponds to an edge of the networkx graph:
+Next, we have the :py:meth:`~hdbscan.plots.CondensedTree.to_pandas` method, which returns a panda DataFrame
+where each row corresponds to an edge of the NetworkX graph:
 
 .. code:: python
 
@@ -258,11 +257,11 @@ the id of the child cluster (or, if the child is a single data point
 rather than a cluster, the index in the dataset of that point), the
 ``lambda_val`` provides the lambda value at which the edge forms, and
 the ``child_size`` provides the number of points in the child cluster.
-As you can see the start of the dataframe has singleton points falling
+As you can see the start of the DataFrame has singleton points falling
 out of the root cluster, with each ``child_size`` equal to 1.
 
 If you want just the clusters, rather than all the individual points
-as well, simply select the rows of the dataframe with ``child_size``
+as well, simply select the rows of the DataFrame with ``child_size``
 greater than 1.
 
 .. code:: python
@@ -293,13 +292,13 @@ array:
 
 
 
-This is equivalent to the pandas dataframe, but is in pure numpy and
+This is equivalent to the pandas DataFrame but is in pure NumPy and
 hence has no pandas dependencies if you do not wish to use pandas.
 
 Single Linkage Trees
 --------------------
 
-We have still more data at our disposal however. As noted in the How
+We have still more data at our disposal, however. As noted in the How
 HDBSCAN Works section, prior to providing a condensed tree the algorithm
 builds a complete dendrogram. We have access to this too via the
 :py:attr:`~hdbscan.HDBSCAN.single_linkage_tree_` attribute of the clusterer.
@@ -333,13 +332,13 @@ As you can see we gain a lot from condensing the tree in terms of better
 presenting and summarising the data. There is a lot less to be gained
 from visual inspection of a plot like this (and it only gets worse for
 larger datasets). The plot function support most of the same
-fucntionality as the dendrogram plotting from
+functionality as the dendrogram plotting from
 ``scipy.cluster.hierarchy``, so you can view various truncations of the
 tree if necessary. In practice, however, you are more likely to be
 interested in access the raw data for further analysis. Again we have
 :py:meth:`~hdbscan.plots.SingleLinkageTree.to_networkx`, :py:meth:`~hdbscan.plots.SingleLinkageTree.to_pandas` and :py:meth:`~hdbscan.plots.SingleLinkageTree.to_numpy`. This time the
-:py:meth:`~hdbscan.plots.SingleLinkageTree.to_networkx` provides a direct networkx version of what you see
-above. The numpy and pandas results conform to the single linkage
+:py:meth:`~hdbscan.plots.SingleLinkageTree.to_networkx` provides a direct NetworkX version of what you see
+above. The NumPy and pandas results conform to the single linkage
 hierarchy format of ``scipy.cluster.hierarchy``, and can be passed to
 routines there if necessary.
 
@@ -360,6 +359,6 @@ noise points (any cluster smaller than the ``minimum_cluster_size``).
     array([ 0, -1,  0, ..., -1, -1,  0])
 
 
-In this way it is possible to extract the DBSCAN clustering that would result
+In this way, it is possible to extract the DBSCAN clustering that would result
 for any given epsilon value, all from one run of hdbscan.
 
diff --git a/docs/basic_hdbscan.rst b/docs/basic_hdbscan.rst
@@ -280,7 +280,11 @@ distance, etc. Again, this is all fine as ``hdbscan`` supports a special
 metric called ``precomputed``. If you create the clusterer with the
 metric set to ``precomputed`` then the clusterer will assume that,
 rather than being handed a vector of points in a vector space, it is
-recieving an all pairs distance matrix.
+recieving an all pairs distance matrix. Missing distances can be
+indicated by ``numpy.inf``, which leads HDBSCAN to ignore these pairwise
+relationships as long as there exists a path between two points that
+contains defined distances (i.e. if there are too many distances
+missing, the clustering is going to fail).
 
 .. code:: python
 
diff --git a/docs/faq.rst b/docs/faq.rst
@@ -47,6 +47,17 @@ Despite the generate model having clearly different "clusters", without more
 data we simply cannot differentiate between these models, and hence no
 density based clustering will manage cluster these according to the model.
 
+Q: I am not getting the claimed performance. Why not?
+-----------------------------------------------------
+
+The most likely explanation is to do with the dimensionality of your input data.
+While HDBSCAN can perform well on low to medium dimensional data the performance
+tends to decrease significantly as dimension increases. In general HDBSCAN can do
+well on up to around 50 or 100 dimensional data, but performance can see 
+significant decreases beyond that. Of course a lot is also dataset dependent, so 
+you can still get good performance even on high dimensional data, but it
+is no longer guaranteed.
+
 Q: I want to predict the cluster of a new unseen point. How do I do this?
 -------------------------------------------------------------------------
 
diff --git a/docs/soft_clustering_explanation.rst b/docs/soft_clustering_explanation.rst
@@ -20,7 +20,7 @@ first cluster" as a point solidly in the center that is very distant
 from the second cluster. Equally, if the clustering algorithm supports
 noise assignments, then points are simply assigned as "noise". We are
 left with no idea as to which, if any cluster, they might have just
-missed the cur on being in.
+missed the cut on being in.
 
 The remedy for this is 'soft clustering' or 'fuzzy clustering'. In this
 approach points are not assigned cluster labels, but are instead assigned
@@ -116,7 +116,7 @@ than ideal. The second way of looking at things is to consider how much
 of an outlier the point is relative to each cluster -- using something
 akin to the outlier scores from GLOSH. The advantage of this approach is
 that it handles odd shaped clusters (even toroidal clusters) far better
-since it will explciitly follow the manifolds of the clusters. The down
+since it will explicitly follow the manifolds of the clusters. The down
 side of the outlier approach is that many points will all be equally
 "outlying", particularly noise points. Our goal is to fuse these two
 ideas.
diff --git a/hdbscan/_hdbscan_tree.pyx b/hdbscan/_hdbscan_tree.pyx
@@ -604,6 +604,8 @@ cpdef list recurse_leaf_dfs(np.ndarray cluster_tree, np.intp_t current_node):
 
 
 cpdef list get_cluster_tree_leaves(np.ndarray cluster_tree):
+    if cluster_tree.shape[0] == 0:
+        return []
     root = cluster_tree['parent'].min()
     return recurse_leaf_dfs(cluster_tree, root)
 
@@ -689,6 +691,10 @@ cpdef tuple get_clusters(np.ndarray tree, dict stability,
                         is_cluster[sub_node] = False
     elif cluster_selection_method == 'leaf':
         leaves = set(get_cluster_tree_leaves(cluster_tree))
+        if len(leaves) == 0:
+            for c in is_cluster:
+                is_cluster[c] = False
+            is_cluster[tree['parent'].min()] = True
         for c in is_cluster:
             if c in leaves:
                 is_cluster[c] = True
diff --git a/hdbscan/hdbscan_.py b/hdbscan/hdbscan_.py
@@ -71,6 +71,13 @@ def _hdbscan_generic(X, min_samples=5, alpha=1.0, metric='minkowski', p=2,
         distance_matrix = pairwise_distances(X, metric=metric, p=p)
     elif metric == 'arccos':
         distance_matrix = pairwise_distances(X, metric='cosine', **kwargs)
+    elif metric == 'precomputed':
+        # Treating this case explicitly, instead of letting
+        #   sklearn.metrics.pairwise_distances handle it,
+        #   enables the usage of numpy.inf in the distance
+        #   matrix to indicate missing distance information.
+        # TODO: Check if copying is necessary
+        distance_matrix = X.copy()
     else:
         distance_matrix = pairwise_distances(X, metric=metric, **kwargs)
 
@@ -86,6 +93,13 @@ def _hdbscan_generic(X, min_samples=5, alpha=1.0, metric='minkowski', p=2,
 
     min_spanning_tree = mst_linkage_core(mutual_reachability_)
 
+    # Warn if the MST couldn't be constructed around the missing distances
+    if np.isinf(min_spanning_tree.T[2]).any():
+        warn('The minimum spanning tree contains edge weights with value '
+             'infinity. Potentially, you are missing too many distances '
+             'in the initial distance matrix for the given neighborhood '
+             'size.', UserWarning)
+
     # mst_linkage_core does not generate a full minimal spanning tree
     # If a tree is required then we must build the edges from the information
     # returned by mst_linkage_core (i.e. just the order of points to be merged)
@@ -282,6 +296,14 @@ def _hdbscan_boruvka_balltree(X, min_samples=5, alpha=1.0,
         return single_linkage_tree, None
 
 
+def check_precomputed_distance_matrix(X):
+    """Perform check_array(X) after removing infinite values (numpy.inf) from the given distance matrix.
+    """
+    tmp = X.copy()
+    tmp[np.isinf(tmp)] = 1
+    check_array(tmp)
+
+
 def hdbscan(X, min_cluster_size=5, min_samples=None, alpha=1.0,
             metric='minkowski', p=2, leaf_size=40,
             algorithm='best', memory=Memory(cachedir=None, verbose=0),
@@ -464,7 +486,13 @@ def hdbscan(X, min_cluster_size=5, min_samples=None, alpha=1.0,
                          'Should be one of: "eom", "leaf"\n')
 
     # Checks input and converts to an nd-array where possible
-    X = check_array(X, accept_sparse='csr')
+    if metric != 'precomputed' or issparse(X):
+        X = check_array(X, accept_sparse='csr')
+    else:
+        # Only non-sparse, precomputed distance matrices are handled here
+        #   and thereby allowed to contain numpy.inf for missing distances
+        check_precomputed_distance_matrix(X)
+
     # Python 2 and 3 compliant string_type checking
     if isinstance(memory, six.string_types):
         memory = Memory(cachedir=memory, verbose=0)
@@ -798,9 +826,16 @@ def fit(self, X, y=None):
         self : object
             Returns self
         """
-        X = check_array(X, accept_sparse='csr')
         if self.metric != 'precomputed':
+            X = check_array(X, accept_sparse='csr')
             self._raw_data = X
+        elif issparse(X):
+            # Handle sparse precomputed distance matrices separately
+            X = check_array(X, accept_sparse='csr')
+        else:
+            # Only non-sparse, precomputed distance matrices are allowed
+            #   to have numpy.inf values indicating missing distances
+            check_precomputed_distance_matrix(X)
 
         kwargs = self.get_params()
         # prediction data only applies to the persistent model, so remove
diff --git a/hdbscan/plots.py b/hdbscan/plots.py
diff --git a/setup.py b/setup.py