Skip to content

Commit 5559983

Browse files
authored
Merge pull request #654 from JelmerBot/dev/flasc-fixes
Fix typo's and avoid internal numpy API in branch detection code.
2 parents 2e7112d + 7037c60 commit 5559983

File tree

3 files changed

+67
-48
lines changed

3 files changed

+67
-48
lines changed

docs/how_to_detect_branches.rst

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
How to detect banches in clusters
1+
How to detect branches in clusters
22
=================================
33

44
HDBSCAN\* is often used to find subpopulations in exploratory data
@@ -14,20 +14,21 @@ does not inform us of the branching structure:
1414
.. image:: images/how_to_detect_branches_3_0.png
1515

1616
Alternatively, HDBSCAN\*’s leaf clusters provide more detail. They
17-
segment the points of different branches into distint clusters. However,
17+
segment the points of different branches into distinct clusters. However,
1818
the partitioning and cluster hierarchy does not (necessarily) tell us how
1919
those clusters combine into a larger shape.
2020

2121
.. image:: images/how_to_detect_branches_5_0.png
2222

2323
This is where the branch detection post-processing step comes into play.
2424
The functionality is described in detail by `Bot et
25-
al <https://arxiv.org/abs/2311.15887>`__. It operates on the detected
26-
clusters and extracts a branch-hierarchy analogous to HDBSCAN\*’s
27-
condensed cluster hierarchy. The process is very similar to HDBSCAN\*
28-
clustering, except that it operates on an in-cluster eccentricity rather
29-
than a density measure. Where peaks in a density profile correspond to
30-
clusters, the peaks in an eccentricity profile correspond to branches:
25+
al <https://arxiv.org/abs/2311.15887>`__ (please reference this paper when using
26+
this functionality). It operates on the detected clusters and extracts a
27+
branch-hierarchy analogous to HDBSCAN\*'s condensed cluster hierarchy. The
28+
process is very similar to HDBSCAN\* clustering, except that it operates on an
29+
in-cluster eccentricity rather than a density measure. Where peaks in a density
30+
profile correspond to clusters, the peaks in an eccentricity profile correspond
31+
to branches:
3132

3233
.. image:: images/how_to_detect_branches_7_0.png
3334

@@ -41,11 +42,18 @@ The resulting partitioning reflects subgroups for clusters and their
4142
branches:
4243

4344
.. code:: python
45+
4446
from hdbscan import HDBSCAN, BranchDetector
4547
4648
clusterer = HDBSCAN(min_cluster_size=15, branch_detection_data=True).fit(data)
4749
branch_detector = BranchDetector(min_branch_size=15).fit(clusterer)
48-
plot(branch_detector.labels_)
50+
51+
# Plot labels
52+
plt.scatter(data[:, 0], data[:, 1], 1, color=[
53+
"silver" if l < 0 else f"C{l % 10}" for l in branch_detector.labels_
54+
])
55+
plt.axis("off")
56+
plt.show()
4957
5058
.. image:: images/how_to_detect_branches_9_0.png
5159

@@ -75,7 +83,7 @@ Most guidelines for tuning HDBSCAN\* also apply for the branch detector:
7583
``allow_single_cluster`` and mostly affects the EOM selection
7684
strategy. When enabled, clusters with bifurcations will be given a
7785
single label if the root segment contains most eccentricity mass
78-
(i.e., branches already merge far from the center and most poinst are
86+
(i.e., branches already merge far from the center and most points are
7987
central).
8088
- ``max_branch_size`` behaves like HDBSCAN\*’s ``max_cluster_size`` and
8189
mostly affects the EOM selection strategy. Branches with more than
@@ -99,7 +107,7 @@ Two parameters are unique to the ``BranchDetector`` class:
99107
all ``min_samples``-nearest neighbours.
100108
- The ``"full"`` method connects all points with a mutual
101109
reachability lower than the maximum distance in the cluster’s MST.
102-
It represents all connectity at the moment the last point joins
110+
It represents all connectivity at the moment the last point joins
103111
the cluster.
104112

105113
These methods differ in their sensitivity, noise robustness, and
@@ -143,7 +151,7 @@ cluster.
143151

144152
The length of the branches also says something about the compactness /
145153
elongatedness of clusters. For example, the branch hierarchy for the
146-
orange ~-shaped cluster is quite different from the same hierarcy for
154+
orange ~-shaped cluster is quite different from the same hierarchy for
147155
the central o-shaped cluster.
148156

149157
.. code:: python

hdbscan/plots.py

Lines changed: 35 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -948,36 +948,46 @@ def __init__(
948948
branch_probabilities,
949949
raw_data=None,
950950
):
951-
self._edges = np.core.records.fromarrays(
952-
np.hstack(
953-
(
954-
np.concatenate(approximation_graphs),
955-
np.repeat(
956-
np.arange(len(approximation_graphs)),
957-
[g.shape[0] for g in approximation_graphs],
958-
)[None].T,
959-
)
960-
).transpose(),
961-
names="parent, child, centrality, mutual_reachability, cluster",
962-
formats="intp, intp, double, double, intp",
951+
self._edges = np.array(
952+
[
953+
(edge[0], edge[1], edge[2], edge[3], cluster)
954+
for cluster, edges in enumerate(approximation_graphs)
955+
for edge in edges
956+
],
957+
dtype=[
958+
("parent", np.intp),
959+
("child", np.intp),
960+
("centrality", np.float64),
961+
("mutual_reachability", np.float64),
962+
("cluster", np.intp),
963+
],
963964
)
964965
self.point_mask = cluster_labels >= 0
965966
self._raw_data = raw_data[self.point_mask, :] if raw_data is not None else None
966-
self._points = np.core.records.fromarrays(
967-
np.vstack(
967+
self._points = np.array(
968+
[
968969
(
969-
np.where(self.point_mask)[0],
970-
labels[self.point_mask],
971-
probabilities[self.point_mask],
972-
cluster_labels[self.point_mask],
973-
cluster_probabilities[self.point_mask],
974-
cluster_centralities[self.point_mask],
975-
branch_labels[self.point_mask],
976-
branch_probabilities[self.point_mask],
970+
i,
971+
labels[i],
972+
probabilities[i],
973+
cluster_labels[i],
974+
cluster_probabilities[i],
975+
cluster_centralities[i],
976+
branch_labels[i],
977+
branch_probabilities[i],
977978
)
978-
),
979-
names="id, label, probability, cluster_label, cluster_probability, cluster_centrality, branch_label, branch_probability",
980-
formats="intp, intp, double, intp, double, double, intp, double",
979+
for i in np.where(self.point_mask)[0]
980+
],
981+
dtype=[
982+
("id", np.intp),
983+
("label", np.intp),
984+
("probability", np.float64),
985+
("cluster_label", np.intp),
986+
("cluster_probability", np.float64),
987+
("cluster_centrality", np.float64),
988+
("branch_label", np.intp),
989+
("branch_probability", np.float64),
990+
],
981991
)
982992
self._pos = None
983993

notebooks/How to detect branches.ipynb

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# How to detect banches in clusters\n",
7+
"# How to detect branches in clusters\n",
88
"\n",
99
"HDBSCAN\\* is often used to find subpopulations in exploratory data analysis\n",
1010
"workflows. Not only clusters themselves, but also their shape can represent\n",
@@ -107,7 +107,7 @@
107107
"metadata": {},
108108
"source": [
109109
"Alternatively, HDBSCAN\\*'s leaf clusters provide more detail. They segment the\n",
110-
"points of different branches into distint clusters. However, the partitioning\n",
110+
"points of different branches into distinct clusters. However, the partitioning\n",
111111
"and cluster hierarchy does not (necessarily) tell us how those clusters combine\n",
112112
"into a larger shape."
113113
]
@@ -143,12 +143,13 @@
143143
"source": [
144144
"This is where the branch detection post-processing step comes into play. The\n",
145145
"functionality is described in detail by [Bot et\n",
146-
"al](https://arxiv.org/abs/2311.15887). It operates on the detected clusters and\n",
147-
"extracts a branch-hierarchy analogous to HDBSCAN*'s condensed cluster hierarchy.\n",
148-
"The process is very similar to HDBSCAN* clustering, except that it operates on\n",
149-
"an in-cluster eccentricity rather than a density measure. Where peaks in a\n",
150-
"density profile correspond to clusters, the peaks in an eccentricity profile\n",
151-
"correspond to branches:"
146+
"al](https://arxiv.org/abs/2311.15887) (please reference this paper when using\n",
147+
"this functionality). It operates on the detected clusters and extracts a\n",
148+
"branch-hierarchy analogous to HDBSCAN\\*'s condensed cluster hierarchy. The\n",
149+
"process is very similar to HDBSCAN\\* clustering, except that it operates on an\n",
150+
"in-cluster eccentricity rather than a density measure. Where peaks in a density\n",
151+
"profile correspond to clusters, the peaks in an eccentricity profile correspond\n",
152+
"to branches:"
152153
]
153154
},
154155
{
@@ -269,7 +270,7 @@
269270
" mostly affects the EOM selection strategy. When enabled, clusters with\n",
270271
" bifurcations will be given a single label if the root segment contains most\n",
271272
" eccentricity mass (i.e., branches already merge far from the center and most\n",
272-
" poinst are central).\n",
273+
" points are central).\n",
273274
"- `max_branch_size` behaves like HDBSCAN\\*'s `max_cluster_size` and mostly\n",
274275
" affects the EOM selection strategy. Branches with more than the specified\n",
275276
" number of points are skipped, selecting their descendants in the hierarchy\n",
@@ -288,7 +289,7 @@
288289
" minimum spanning tree under HDBSCAN\\*'s mutual reachability distance. This\n",
289290
" graph contains the detected MST and all `min_samples`-nearest neighbours. \n",
290291
" - The `\"full\"` method connects all points with a mutual reachability lower\n",
291-
" than the maximum distance in the cluster's MST. It represents all connectity\n",
292+
" than the maximum distance in the cluster's MST. It represents all connectivity\n",
292293
" at the moment the last point joins the cluster. These methods differ in\n",
293294
" their sensitivity, noise robustness, and computational cost. The `\"core\"`\n",
294295
" method usually needs slightly higher `min_branch_size` values to suppress\n",
@@ -348,7 +349,7 @@
348349
"source": [
349350
"The length of the branches also says something about the compactness /\n",
350351
"elongatedness of clusters. For example, the branch hierarchy for the orange\n",
351-
"~-shaped cluster is quite different from the same hierarcy for the central\n",
352+
"~-shaped cluster is quite different from the same hierarchy for the central\n",
352353
"o-shaped cluster."
353354
]
354355
},

0 commit comments

Comments
 (0)