Skip to content

Commit 7a7a0de

Browse files
committed
Merge remote-tracking branch 'origin/master'
2 parents 7bcc9ab + 16b02de commit 7a7a0de

File tree

5 files changed

+43
-28
lines changed

5 files changed

+43
-28
lines changed

README.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -268,7 +268,7 @@ If you have used this codebase in a scientific publication and wish to cite it,
268268
In: Journal of Open Source Software, The Open Journal, volume 2, number 11.
269269
2017
270270

271-
To refernece the high performance algorithm developed in this library please cite our paper in ICDMW 2017 proceedings.
271+
To reference the high performance algorithm developed in this library please cite our paper in ICDMW 2017 proceedings.
272272

273273
McInnes L, Healy J. *Accelerated Hierarchical Density Based Clustering*
274274
In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42.

docs/comparing_clustering_algorithms.rst

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -171,14 +171,16 @@ multiple different clusterings. This does not engender much confidence
171171
in any individual clustering that may result.
172172

173173
So, in summary, here's how K-Means seems to stack up against out
174-
desiderata: \* **Don't be wrong!**: K-means is going to throw points
174+
desiderata:
175+
- **Don't be wrong!**: K-means is going to throw points
175176
into clusters whether they belong or not; it also assumes you clusters
176-
are globular. K-Means scores very poorly on this point. \* **Intuitive
177-
parameters**: If you have a good intuition for how many clusters the
177+
are globular. K-Means scores very poorly on this point.
178+
- **Intuitive parameters**: If you have a good intuition for how many clusters the
178179
dataset your exploring has then great, otherwise you might have a
179-
problem. \* **Stability**: Hopefully the clustering is stable for your
180-
data. Best to have many runs and check though. \* **Performance**: This
181-
is K-Means big win. It's a simple algorithm and with the right tricks
180+
problem.
181+
- **Stability**: Hopefully the clustering is stable for your
182+
data. Best to have many runs and check though.
183+
- **Performance**: This is K-Means big win. It's a simple algorithm and with the right tricks
182184
and optimizations can be made exceptionally efficient. There are few
183185
algorithms that can compete with K-Means for performance. If you have
184186
truly huge data then K-Means might be your only option.

docs/parameter_selection.rst

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -112,12 +112,13 @@ of 15.
112112
As you can see this results in us recovering something much closer to
113113
our original clustering, only now with some of the smaller clusters
114114
pruned out. Thus ``min_cluster_size`` does behave more closely to our
115-
intuitions, but only if we fix ``min_samples``. If you wish to explore
116-
different ``min_cluster_size`` settings with a fixed ``min_samples``
117-
value, especially for larger dataset sizes, you can cache the hard
118-
computation, and recompute only the relatively cheap flat cluster
119-
extraction using the ``memory`` parameter, which makes use of
120-
`joblib <https://pythonhosted.org/joblib/>`_
115+
intuitions, but only if we fix ``min_samples``.
116+
117+
If you wish to explore different ``min_cluster_size`` settings with
118+
a fixed ``min_samples`` value, especially for larger dataset sizes,
119+
you can cache the hard computation, and recompute only the relatively
120+
cheap flat cluster extraction using the ``memory`` parameter, which
121+
makes use of `joblib <https://pythonhosted.org/joblib/>`_
121122

122123
.. _min_samples_label:
123124

@@ -134,6 +135,9 @@ to progressively more dense areas. We can see this in practice by
134135
leaving the ``min_cluster_size`` at 60, but reducing ``min_samples`` to
135136
1.
136137

138+
Note: adjusting ``min_samples`` will result in recomputing the **hard
139+
comptuation** of the single linkage tree.
140+
137141
.. code:: python
138142
139143
clusterer = hdbscan.HDBSCAN(min_cluster_size=60, min_samples=1).fit(data)
@@ -181,6 +185,9 @@ clustering is. By default ``alpha`` is set to 1.0. Increasing ``alpha``
181185
will make the clustering more conservative, but on a much tighter scale,
182186
as we can see by setting ``alpha`` to 1.3.
183187

188+
Note: adjusting ``alpha`` will result in recomputing the **hard
189+
comptuation** of the single linkage tree.
190+
184191
.. code:: python
185192
186193
clusterer = hdbscan.HDBSCAN(min_cluster_size=60, min_samples=15, alpha=1.3).fit(data)

hdbscan/dist_metrics.pyx

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1108,36 +1108,39 @@ cdef class ArccosDistance(DistanceMetric):
11081108
#
11091109
cdef class PyFuncDistance(DistanceMetric):
11101110
"""PyFunc Distance
1111-
11121111
A user-defined distance
1113-
11141112
Parameters
11151113
----------
11161114
func : function
11171115
func should take two numpy arrays as input, and return a distance.
11181116
"""
11191117
def __init__(self, func, **kwargs):
11201118
self.func = func
1121-
x = np.random.random(10)
1122-
try:
1123-
d = self.func(x, x, **kwargs)
1124-
except TypeError:
1125-
raise ValueError("func must be a callable taking two arrays")
1126-
1127-
try:
1128-
d = float(d)
1129-
except TypeError:
1130-
raise ValueError("func must return a float")
1131-
11321119
self.kwargs = kwargs
11331120

1121+
# in cython < 0.26, GIL was required to be acquired during definition of
1122+
# the function and inside the body of the function. This behaviour is not
1123+
# allowed in cython >= 0.26 since it is a redundant GIL acquisition. The
1124+
# only way to be back compatible is to inherit `dist` from the base class
1125+
# without GIL and called an inline `_dist` which acquire GIL.
11341126
cdef inline DTYPE_t dist(self, DTYPE_t* x1, DTYPE_t* x2,
1135-
ITYPE_t size) except -1 with gil:
1127+
ITYPE_t size) nogil except -1:
1128+
return self._dist(x1, x2, size)
1129+
1130+
cdef inline DTYPE_t _dist(self, DTYPE_t* x1, DTYPE_t* x2,
1131+
ITYPE_t size) except -1 with gil:
11361132
cdef np.ndarray x1arr
11371133
cdef np.ndarray x2arr
11381134
x1arr = _buffer_to_ndarray(x1, size)
11391135
x2arr = _buffer_to_ndarray(x2, size)
1140-
return self.func(x1arr, x2arr, **self.kwargs)
1136+
d = self.func(x1arr, x2arr, **self.kwargs)
1137+
try:
1138+
# Cython generates code here that results in a TypeError
1139+
# if d is the wrong type.
1140+
return d
1141+
except TypeError:
1142+
raise TypeError("Custom distance function must accept two "
1143+
"vectors and return a float.")
11411144

11421145

11431146
cdef inline double fmax(double a, double b) nogil:

hdbscan/hdbscan_.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -503,6 +503,9 @@ def hdbscan(X, min_cluster_size=5, min_samples=None, alpha=1.0,
503503
min_samples = 1
504504

505505
if algorithm != 'best':
506+
if metric != 'precomputed' and issparse(X) and metric != 'generic':
507+
raise ValueError("Sparse data matrices only support algorithm 'generic'.")
508+
506509
if algorithm == 'generic':
507510
(single_linkage_tree,
508511
result_min_span_tree) = memory.cache(

0 commit comments

Comments
 (0)