Skip to content

Commit 2212af5

Browse files
committed
Merge remote-tracking branch 'origin/master'
# Conflicts: # setup.py
2 parents 9ed91e1 + 528238e commit 2212af5

File tree

9 files changed

+135
-56
lines changed

9 files changed

+135
-56
lines changed

README.rst

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,25 @@ Install the package
218218
.. code:: bash
219219
220220
python setup.py install
221-
221+
222+
-----------------
223+
Running the Tests
224+
-----------------
225+
226+
The package tests can be run after installation using the command:
227+
228+
.. code:: bash
229+
230+
nosetests -s hdbscan
231+
232+
or, if ``nose`` is installed but ``nosetests`` is not in your ``PATH`` variable:
233+
234+
.. code:: bash
235+
236+
python -m nose -s hdbscan
237+
238+
If one or more of the tests fail, please report a bug at https://github.com/scikit-learn-contrib/hdbscan/issues/new
239+
222240
--------------
223241
Python Version
224242
--------------

docs/advanced_hdbscan.rst

Lines changed: 20 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,13 @@ Getting More Information About a Clustering
33
===========================================
44

55
Once you have the basics of clustering sorted you may want to dig a
6-
little deeper than just the cluster labels returned to you. Fortunately
7-
the hdbscan library provides you with the facilities to do this. During
6+
little deeper than just the cluster labels returned to you. Fortunately, the hdbscan library provides you with the facilities to do this. During
87
processing HDBSCAN\* builds a hierarchy of potential clusters, from
98
which it extracts the flat clustering returned. It can be informative to
109
look at that hierarchy, and potentially make use of the extra
1110
information contained therein.
1211

13-
Suppose we have a dataset for clustering. It is a binary file in nunpy format and it can be found at https://github.com/lmcinnes/hdbscan/blob/master/notebooks/clusterable_data.npy.
12+
Suppose we have a dataset for clustering. It is a binary file in NumPy format and it can be found at https://github.com/lmcinnes/hdbscan/blob/master/notebooks/clusterable_data.npy.
1413

1514
.. code:: python
1615
@@ -116,7 +115,7 @@ each branch representing the number of points in the cluster at that
116115
level. If we wish to know which branches were selected by the HDBSCAN\*
117116
algorithm we can pass ``select_clusters=True``. You can even pass a
118117
selection palette to color the selections according to the cluster
119-
labelling.
118+
labeling.
120119

121120
.. code:: python
122121
@@ -127,8 +126,8 @@ labelling.
127126
.. image:: images/advanced_hdbscan_11_1.png
128127

129128

130-
From this we can see, for example, that the yellow cluster, at the
131-
center of the plot, forms early (breaking off from the pale blue and
129+
From this, we can see, for example, that the yellow cluster at the
130+
center of the plot forms early (breaking off from the pale blue and
132131
purple clusters) and persists for a long time. By comparison the green
133132
cluster, which also forms early, quickly breaks apart and then
134133
vanishes altogether (shattering into clusters all smaller than the
@@ -141,7 +140,7 @@ for example, in the dark blue cluster.
141140

142141
If this was a simple visual analysis of the condensed tree can tell you
143142
a lot more about the structure of your data. This is not all we can do
144-
with condensed trees however. For larger and more complex datasets the
143+
with condensed trees, however. For larger and more complex datasets the
145144
tree itself may be very complex, and it may be desirable to run more
146145
interesting analytics over the tree itself. This can be achieved via
147146
several converter methods: :py:meth:`~hdbscan.plots.CondensedTree.to_networkx`, :py:meth:`~hdbscan.plots.CondensedTree.to_pandas`, and
@@ -162,9 +161,9 @@ First we'll consider :py:meth:`~hdbscan.plots.CondensedTree.to_networkx`
162161
163162
164163
165-
As you can see we get a networkx directed graph, which we can then use
166-
all the regular networkx tools and analytics on. The graph is richer
167-
than the visual plot above may lead you to believe however:
164+
As you can see we get a NetworkX directed graph, which we can then use
165+
all the regular NetworkX tools and analytics on. The graph is richer
166+
than the visual plot above may lead you to believe, however:
168167

169168
.. code:: python
170169
@@ -182,12 +181,12 @@ than the visual plot above may lead you to believe however:
182181
183182
The graph actually contains nodes for all the points falling out of
184183
clusters as well as the clusters themselves. Each node has an associated
185-
``size`` attribute, and each edge has a ``weight`` of the lambda value
184+
``size`` attribute and each edge has a ``weight`` of the lambda value
186185
at which that edge forms. This allows for much more interesting
187186
analyses.
188187

189-
Next we have the :py:meth:`~hdbscan.plots.CondensedTree.to_pandas` method, which returns a panda dataframe
190-
where each row corresponds to an edge of the networkx graph:
188+
Next, we have the :py:meth:`~hdbscan.plots.CondensedTree.to_pandas` method, which returns a panda DataFrame
189+
where each row corresponds to an edge of the NetworkX graph:
191190

192191
.. code:: python
193192
@@ -258,11 +257,11 @@ the id of the child cluster (or, if the child is a single data point
258257
rather than a cluster, the index in the dataset of that point), the
259258
``lambda_val`` provides the lambda value at which the edge forms, and
260259
the ``child_size`` provides the number of points in the child cluster.
261-
As you can see the start of the dataframe has singleton points falling
260+
As you can see the start of the DataFrame has singleton points falling
262261
out of the root cluster, with each ``child_size`` equal to 1.
263262

264263
If you want just the clusters, rather than all the individual points
265-
as well, simply select the rows of the dataframe with ``child_size``
264+
as well, simply select the rows of the DataFrame with ``child_size``
266265
greater than 1.
267266

268267
.. code:: python
@@ -293,13 +292,13 @@ array:
293292
294293
295294
296-
This is equivalent to the pandas dataframe, but is in pure numpy and
295+
This is equivalent to the pandas DataFrame but is in pure NumPy and
297296
hence has no pandas dependencies if you do not wish to use pandas.
298297

299298
Single Linkage Trees
300299
--------------------
301300

302-
We have still more data at our disposal however. As noted in the How
301+
We have still more data at our disposal, however. As noted in the How
303302
HDBSCAN Works section, prior to providing a condensed tree the algorithm
304303
builds a complete dendrogram. We have access to this too via the
305304
:py:attr:`~hdbscan.HDBSCAN.single_linkage_tree_` attribute of the clusterer.
@@ -333,13 +332,13 @@ As you can see we gain a lot from condensing the tree in terms of better
333332
presenting and summarising the data. There is a lot less to be gained
334333
from visual inspection of a plot like this (and it only gets worse for
335334
larger datasets). The plot function support most of the same
336-
fucntionality as the dendrogram plotting from
335+
functionality as the dendrogram plotting from
337336
``scipy.cluster.hierarchy``, so you can view various truncations of the
338337
tree if necessary. In practice, however, you are more likely to be
339338
interested in access the raw data for further analysis. Again we have
340339
:py:meth:`~hdbscan.plots.SingleLinkageTree.to_networkx`, :py:meth:`~hdbscan.plots.SingleLinkageTree.to_pandas` and :py:meth:`~hdbscan.plots.SingleLinkageTree.to_numpy`. This time the
341-
:py:meth:`~hdbscan.plots.SingleLinkageTree.to_networkx` provides a direct networkx version of what you see
342-
above. The numpy and pandas results conform to the single linkage
340+
:py:meth:`~hdbscan.plots.SingleLinkageTree.to_networkx` provides a direct NetworkX version of what you see
341+
above. The NumPy and pandas results conform to the single linkage
343342
hierarchy format of ``scipy.cluster.hierarchy``, and can be passed to
344343
routines there if necessary.
345344

@@ -360,6 +359,6 @@ noise points (any cluster smaller than the ``minimum_cluster_size``).
360359
array([ 0, -1, 0, ..., -1, -1, 0])
361360
362361
363-
In this way it is possible to extract the DBSCAN clustering that would result
362+
In this way, it is possible to extract the DBSCAN clustering that would result
364363
for any given epsilon value, all from one run of hdbscan.
365364

docs/basic_hdbscan.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,11 @@ distance, etc. Again, this is all fine as ``hdbscan`` supports a special
280280
metric called ``precomputed``. If you create the clusterer with the
281281
metric set to ``precomputed`` then the clusterer will assume that,
282282
rather than being handed a vector of points in a vector space, it is
283-
recieving an all pairs distance matrix.
283+
recieving an all pairs distance matrix. Missing distances can be
284+
indicated by ``numpy.inf``, which leads HDBSCAN to ignore these pairwise
285+
relationships as long as there exists a path between two points that
286+
contains defined distances (i.e. if there are too many distances
287+
missing, the clustering is going to fail).
284288

285289
.. code:: python
286290

docs/faq.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,17 @@ Despite the generate model having clearly different "clusters", without more
4747
data we simply cannot differentiate between these models, and hence no
4848
density based clustering will manage cluster these according to the model.
4949

50+
Q: I am not getting the claimed performance. Why not?
51+
-----------------------------------------------------
52+
53+
The most likely explanation is to do with the dimensionality of your input data.
54+
While HDBSCAN can perform well on low to medium dimensional data the performance
55+
tends to decrease significantly as dimension increases. In general HDBSCAN can do
56+
well on up to around 50 or 100 dimensional data, but performance can see
57+
significant decreases beyond that. Of course a lot is also dataset dependent, so
58+
you can still get good performance even on high dimensional data, but it
59+
is no longer guaranteed.
60+
5061
Q: I want to predict the cluster of a new unseen point. How do I do this?
5162
-------------------------------------------------------------------------
5263

docs/soft_clustering_explanation.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ first cluster" as a point solidly in the center that is very distant
2020
from the second cluster. Equally, if the clustering algorithm supports
2121
noise assignments, then points are simply assigned as "noise". We are
2222
left with no idea as to which, if any cluster, they might have just
23-
missed the cur on being in.
23+
missed the cut on being in.
2424

2525
The remedy for this is 'soft clustering' or 'fuzzy clustering'. In this
2626
approach points are not assigned cluster labels, but are instead assigned
@@ -116,7 +116,7 @@ than ideal. The second way of looking at things is to consider how much
116116
of an outlier the point is relative to each cluster -- using something
117117
akin to the outlier scores from GLOSH. The advantage of this approach is
118118
that it handles odd shaped clusters (even toroidal clusters) far better
119-
since it will explciitly follow the manifolds of the clusters. The down
119+
since it will explicitly follow the manifolds of the clusters. The down
120120
side of the outlier approach is that many points will all be equally
121121
"outlying", particularly noise points. Our goal is to fuse these two
122122
ideas.

hdbscan/_hdbscan_tree.pyx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -604,6 +604,8 @@ cpdef list recurse_leaf_dfs(np.ndarray cluster_tree, np.intp_t current_node):
604604

605605

606606
cpdef list get_cluster_tree_leaves(np.ndarray cluster_tree):
607+
if cluster_tree.shape[0] == 0:
608+
return []
607609
root = cluster_tree['parent'].min()
608610
return recurse_leaf_dfs(cluster_tree, root)
609611

@@ -689,6 +691,10 @@ cpdef tuple get_clusters(np.ndarray tree, dict stability,
689691
is_cluster[sub_node] = False
690692
elif cluster_selection_method == 'leaf':
691693
leaves = set(get_cluster_tree_leaves(cluster_tree))
694+
if len(leaves) == 0:
695+
for c in is_cluster:
696+
is_cluster[c] = False
697+
is_cluster[tree['parent'].min()] = True
692698
for c in is_cluster:
693699
if c in leaves:
694700
is_cluster[c] = True

hdbscan/hdbscan_.py

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,13 @@ def _hdbscan_generic(X, min_samples=5, alpha=1.0, metric='minkowski', p=2,
7171
distance_matrix = pairwise_distances(X, metric=metric, p=p)
7272
elif metric == 'arccos':
7373
distance_matrix = pairwise_distances(X, metric='cosine', **kwargs)
74+
elif metric == 'precomputed':
75+
# Treating this case explicitly, instead of letting
76+
# sklearn.metrics.pairwise_distances handle it,
77+
# enables the usage of numpy.inf in the distance
78+
# matrix to indicate missing distance information.
79+
# TODO: Check if copying is necessary
80+
distance_matrix = X.copy()
7481
else:
7582
distance_matrix = pairwise_distances(X, metric=metric, **kwargs)
7683

@@ -86,6 +93,13 @@ def _hdbscan_generic(X, min_samples=5, alpha=1.0, metric='minkowski', p=2,
8693

8794
min_spanning_tree = mst_linkage_core(mutual_reachability_)
8895

96+
# Warn if the MST couldn't be constructed around the missing distances
97+
if np.isinf(min_spanning_tree.T[2]).any():
98+
warn('The minimum spanning tree contains edge weights with value '
99+
'infinity. Potentially, you are missing too many distances '
100+
'in the initial distance matrix for the given neighborhood '
101+
'size.', UserWarning)
102+
89103
# mst_linkage_core does not generate a full minimal spanning tree
90104
# If a tree is required then we must build the edges from the information
91105
# returned by mst_linkage_core (i.e. just the order of points to be merged)
@@ -282,6 +296,14 @@ def _hdbscan_boruvka_balltree(X, min_samples=5, alpha=1.0,
282296
return single_linkage_tree, None
283297

284298

299+
def check_precomputed_distance_matrix(X):
300+
"""Perform check_array(X) after removing infinite values (numpy.inf) from the given distance matrix.
301+
"""
302+
tmp = X.copy()
303+
tmp[np.isinf(tmp)] = 1
304+
check_array(tmp)
305+
306+
285307
def hdbscan(X, min_cluster_size=5, min_samples=None, alpha=1.0,
286308
metric='minkowski', p=2, leaf_size=40,
287309
algorithm='best', memory=Memory(cachedir=None, verbose=0),
@@ -464,7 +486,13 @@ def hdbscan(X, min_cluster_size=5, min_samples=None, alpha=1.0,
464486
'Should be one of: "eom", "leaf"\n')
465487

466488
# Checks input and converts to an nd-array where possible
467-
X = check_array(X, accept_sparse='csr')
489+
if metric != 'precomputed' or issparse(X):
490+
X = check_array(X, accept_sparse='csr')
491+
else:
492+
# Only non-sparse, precomputed distance matrices are handled here
493+
# and thereby allowed to contain numpy.inf for missing distances
494+
check_precomputed_distance_matrix(X)
495+
468496
# Python 2 and 3 compliant string_type checking
469497
if isinstance(memory, six.string_types):
470498
memory = Memory(cachedir=memory, verbose=0)
@@ -798,9 +826,16 @@ def fit(self, X, y=None):
798826
self : object
799827
Returns self
800828
"""
801-
X = check_array(X, accept_sparse='csr')
802829
if self.metric != 'precomputed':
830+
X = check_array(X, accept_sparse='csr')
803831
self._raw_data = X
832+
elif issparse(X):
833+
# Handle sparse precomputed distance matrices separately
834+
X = check_array(X, accept_sparse='csr')
835+
else:
836+
# Only non-sparse, precomputed distance matrices are allowed
837+
# to have numpy.inf values indicating missing distances
838+
check_precomputed_distance_matrix(X)
804839

805840
kwargs = self.get_params()
806841
# prediction data only applies to the persistent model, so remove

0 commit comments

Comments
 (0)