Skip to content

Commit a9f74b2

Browse files
authored
Merge pull request #648 from JelmerBot/dev/flasc
Add branch detection functionality
2 parents 94ce7a8 + 04c77fe commit a9f74b2

23 files changed

+3271
-24
lines changed

README.rst

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,40 @@ Based on the paper:
163163
*"Rates of convergence for the cluster tree."*
164164
In Advances in Neural Information Processing Systems, 2010.
165165

166+
----------------
167+
Branch detection
168+
----------------
169+
170+
The hdbscan package supports a branch-detection post-processing step
171+
by `Bot et al. <https://arxiv.org/abs/2311.15887>`_. Cluster shapes,
172+
such as branching structures, can reveal interesting patterns
173+
that are not expressed in density-based cluster hierarchies. The
174+
BranchDetector class mimics the HDBSCAN API and can be used to
175+
detect branching hierarchies in clusters. It provides condensed
176+
branch hierarchies, branch persistences, and branch memberships and
177+
supports joblib's caching functionality. A notebook
178+
`demonstrating the BranchDetector is available <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/How%20to%20detect%20branches.ipynb>`_.
179+
180+
Example usage:
181+
182+
.. code:: python
183+
184+
import hdbscan
185+
from sklearn.datasets import make_blobs
186+
187+
data, _ = make_blobs(1000)
188+
189+
clusterer = hdbscan.HDBSCAN(branch_detection_data=True).fit(data)
190+
branch_detector = hdbscan.BranchDetector().fit(clusterer)
191+
branch_detector.cluster_approximation_graph_.plot(edge_width=0.1)
192+
193+
194+
Based on the paper:
195+
D. M. Bot, J. Peeters, J. Liesenborgs and J. Aerts
196+
*"FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN\* for Detecting Branches in Clusters"*
197+
Arxiv 2311.15887, 2023.
198+
199+
166200
----------
167201
Installing
168202
----------
@@ -300,6 +334,24 @@ To reference the high performance algorithm developed in this library please cit
300334
organization={IEEE}
301335
}
302336
337+
If you used the branch-detection functionality in this codebase in a scientific publication and which to cite it, please use the `Arxiv preprint <https://arxiv.org/abs/2311.15887>`_:
338+
339+
D. M. Bot, J. Peeters, J. Liesenborgs and J. Aerts
340+
*"FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN\* for Detecting Branches in Clusters"*
341+
Arxiv 2311.15887, 2023.
342+
343+
.. code:: bibtex
344+
345+
@misc{bot2023flasc,
346+
title={FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN* for Detecting Branches in Clusters},
347+
author={D. M. Bot and J. Peeters and J. Liesenborgs and J. Aerts},
348+
year={2023},
349+
eprint={2311.15887},
350+
archivePrefix={arXiv},
351+
primaryClass={cs.LG},
352+
url={https://arxiv.org/abs/2311.15887},
353+
}
354+
303355
---------
304356
Licensing
305357
---------

docs/api.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,15 @@ and the prediction module.
3636

3737
.. automodule:: hdbscan.prediction
3838
:members:
39+
40+
41+
Branch detection
42+
----------------
43+
44+
The branches module contains classes for detecting branches within clusters.
45+
46+
.. automodule:: hdbscan.branches
47+
:members:
48+
49+
.. autoclass:: hdbscan.plots.ApproximationGraph
50+
:members:

docs/how_to_detect_branches.rst

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
How to detect banches in clusters
2+
=================================
3+
4+
HDBSCAN\* is often used to find subpopulations in exploratory data
5+
analysis workflows. Not only clusters themselves, but also their shape
6+
can represent meaningful subpopulations. For example, a Y-shaped cluster
7+
may represent an evolving process with two distinct end-states.
8+
Detecting these branches can reveal interesting patterns that are not
9+
captured by density-based clustering.
10+
11+
For example, HDBSCAN\* finds 4 clusters in the datasets below, which
12+
does not inform us of the branching structure:
13+
14+
.. image:: images/how_to_detect_branches_3_0.png
15+
16+
Alternatively, HDBSCAN\*’s leaf clusters provide more detail. They
17+
segment the points of different branches into distint clusters. However,
18+
the partitioning and cluster hierarchy does not (necessarily) tell us how
19+
those clusters combine into a larger shape.
20+
21+
.. image:: images/how_to_detect_branches_5_0.png
22+
23+
This is where the branch detection post-processing step comes into play.
24+
The functionality is described in detail by `Bot et
25+
al <https://arxiv.org/abs/2311.15887>`__. It operates on the detected
26+
clusters and extracts a branch-hierarchy analogous to HDBSCAN\*’s
27+
condensed cluster hierarchy. The process is very similar to HDBSCAN\*
28+
clustering, except that it operates on an in-cluster eccentricity rather
29+
than a density measure. Where peaks in a density profile correspond to
30+
clusters, the peaks in an eccentricity profile correspond to branches:
31+
32+
.. image:: images/how_to_detect_branches_7_0.png
33+
34+
Using the branch detection functionality is fairly straightforward.
35+
First, run hdbscan with parameter ``branch_detection_data=True``. This
36+
tells hdbscan to cache the internal data structures needed for the
37+
branch detection process. Then, configure the ``BranchDetector`` class
38+
and fit is with the HDBSCAN object.
39+
40+
The resulting partitioning reflects subgroups for clusters and their
41+
branches:
42+
43+
.. code:: python
44+
from hdbscan import HDBSCAN, BranchDetector
45+
46+
clusterer = HDBSCAN(min_cluster_size=15, branch_detection_data=True).fit(data)
47+
branch_detector = BranchDetector(min_branch_size=15).fit(clusterer)
48+
plot(branch_detector.labels_)
49+
50+
.. image:: images/how_to_detect_branches_9_0.png
51+
52+
53+
Parameter selection
54+
-------------------
55+
56+
The ``BranchDetector``’s main parameters are very similar to HDBSCAN.
57+
Most guidelines for tuning HDBSCAN\* also apply for the branch detector:
58+
59+
- ``min_branch_size`` behaves like HDBSCAN\*’s ``min_cluster_size``. It
60+
configures how many points branches need to contain. Values around 10
61+
to 25 points tend to work well. Lower values are useful when looking
62+
for smaller structures. Higher values can be used to suppress noise
63+
if present.
64+
- ``branch_selection_method`` behaves like HDBSCAN\*’s
65+
``cluster_selection_method``. The leaf and Excess of Mass (EOM)
66+
strategies are used to select branches from the condensed
67+
hierarchies. By default, branches are only reflected in the final
68+
labelling for clusters that have 3 or more branches (at least one
69+
bifurcation).
70+
- ``branch_selection_persistence`` replaces HDBSCAN\*’s
71+
``cluster_selection_epsilon``. This parameter can be used to suppress
72+
branches with a short eccentricity range (y-range in the condensed
73+
hierarchy plot).
74+
- ``allow_single_branch`` behaves like HDBSCAN\*’s
75+
``allow_single_cluster`` and mostly affects the EOM selection
76+
strategy. When enabled, clusters with bifurcations will be given a
77+
single label if the root segment contains most eccentricity mass
78+
(i.e., branches already merge far from the center and most poinst are
79+
central).
80+
- ``max_branch_size`` behaves like HDBSCAN\*’s ``max_cluster_size`` and
81+
mostly affects the EOM selection strategy. Branches with more than
82+
the specified number of points are skipped, selecting their
83+
descendants in the hierarchy instead.
84+
85+
Two parameters are unique to the ``BranchDetector`` class:
86+
87+
- ``branch_detection_method`` determines which points are connected
88+
within a cluster. Both density-based clustering and the branch detection
89+
process need to determine which points are part of the same
90+
density/eccentricity peak. HDBSCAN\* defines density in terms of the distance
91+
between points, providing natural way to define which points are connected at
92+
some density value. Eccentricity does not have such a connection. So, we use
93+
information from the clusters to determine which points should be connected
94+
instead.
95+
96+
- The ``"core"`` method selects all edges that could be part of the
97+
cluster’s minimum spanning tree under HDBSCAN\*’s mutual
98+
reachability distance. This graph contains the detected MST and
99+
all ``min_samples``-nearest neighbours.
100+
- The ``"full"`` method connects all points with a mutual
101+
reachability lower than the maximum distance in the cluster’s MST.
102+
It represents all connectity at the moment the last point joins
103+
the cluster.
104+
105+
These methods differ in their sensitivity, noise robustness, and
106+
computational cost. The ``"core"`` method usually needs slightly
107+
higher ``min_branch_size`` values to suppress noisy branches than the
108+
``"full"`` method. It is a good choice when branches span large
109+
density ranges.
110+
111+
- ``label_sides_as_branches`` determines whether the sides of an
112+
elongated cluster without bifurcations (l-shape) are represented as
113+
distinct subgroups. By default a cluster needs to have one
114+
bifurcation (Y-shape) before the detected branches are represented in
115+
the final labelling.
116+
117+
118+
Useful attributes
119+
-----------------
120+
121+
Like the HDBSCAN class, the BranchDetector class contains several useful
122+
attributes for exploring datasets.
123+
124+
Branch hierarchy
125+
~~~~~~~~~~~~~~~~
126+
127+
Branch hierarchies reflect the tree-shape of clusters. Like the cluster
128+
hierarchy, branch hierarchies can be used to interpret which branches
129+
exist. In addition, they reflect how far apart branches merge into the
130+
cluster.
131+
132+
.. code:: python
133+
134+
idx = np.argmax([len(x) for x in branch_detector.branch_persistences_])
135+
branch_detector.cluster_condensed_trees_[idx].plot(
136+
select_clusters=True, selection_palette=["C3", "C4", "C5"]
137+
)
138+
plt.ylabel("Eccentricity")
139+
plt.title(f"Branches in cluster {idx}")
140+
plt.show()
141+
142+
.. image:: images/how_to_detect_branches_13_0.png
143+
144+
The length of the branches also says something about the compactness /
145+
elongatedness of clusters. For example, the branch hierarchy for the
146+
orange ~-shaped cluster is quite different from the same hierarcy for
147+
the central o-shaped cluster.
148+
149+
.. code:: python
150+
151+
plt.figure(figsize=(6, 3))
152+
plt.subplot(1, 2, 1)
153+
idx = np.argmin([min(*x) for x in branch_detector.branch_persistences_])
154+
branch_detector.cluster_condensed_trees_[idx].plot(colorbar=False)
155+
plt.ylim([0.3, 0])
156+
plt.ylabel("Eccentricity")
157+
plt.title(f"Cluster {idx} (spherical)")
158+
159+
plt.subplot(1, 2, 2)
160+
idx = np.argmax([max(*x) for x in branch_detector.branch_persistences_])
161+
branch_detector.cluster_condensed_trees_[idx].plot(colorbar=False)
162+
plt.ylim([0.3, 0])
163+
plt.ylabel("Eccentricity")
164+
plt.title(f"Cluster {idx} (elongated)")
165+
plt.show()
166+
167+
.. image:: images/how_to_detect_branches_15_0.png
168+
169+
Cluster approximation graphs
170+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
171+
172+
Branches are detected using a graph that approximates the connectivity
173+
within a cluster. These graphs are available in the
174+
``cluster_approximation_graph_`` property and can be used to visualise
175+
data and the branch-detection process. The plotting function is based on
176+
the networkx API and uses networkx functionality to compute a layout if
177+
positions are not provided. Using UMAP to compute positions can be
178+
faster and more expressive. Several helper functions for exporting to
179+
numpy, pandas, and networkx are available.
180+
181+
For example, a figure with points coloured by the final labelling:
182+
183+
.. code:: python
184+
185+
g = branch_detector.cluster_approximation_graph_
186+
g.plot(positions=data, node_size=5, edge_width=0.2, edge_alpha=0.2)
187+
plt.show()
188+
189+
.. image:: images/how_to_detect_branches_17_0.png
190+
191+
Or, a figure with the edges coloured by centrality:
192+
193+
.. code:: python
194+
195+
g.plot(
196+
positions=data,
197+
node_alpha=0,
198+
edge_color="centrality",
199+
edge_cmap="turbo",
200+
edge_width=0.2,
201+
edge_alpha=0.2,
202+
edge_vmax=100,
203+
)
204+
plt.show()
205+
206+
.. image:: images/how_to_detect_branches_19_0.png
207+
208+
209+
Approximate predict
210+
-------------------
211+
212+
A branch-aware ``approximate_predict_branch`` function is available to
213+
predicts branch labels for new points. This function uses a fitted
214+
BranchDetector object to first predict cluster labels and then the
215+
branch labels.
216+
217+
.. code:: python
218+
219+
from hdbscan import approximate_predict_branch
220+
221+
new_points = np.asarray([[0.4, 0.25], [0.23, 0.2], [-0.14, -0.2]])
222+
clusterer.generate_prediction_data()
223+
labels, probs, cluster_labels, cluster_probs, branch_labels, branch_probs = (
224+
approximate_predict_branch(branch_detector, new_points)
225+
)
226+
227+
plt.scatter(
228+
new_points.T[0],
229+
new_points.T[1],
230+
140,
231+
labels % 10,
232+
marker="p",
233+
zorder=5,
234+
cmap="tab10",
235+
vmin=0,
236+
vmax=9,
237+
edgecolor="k",
238+
)
239+
plot(branch_detector.labels_)
240+
plt.show()
241+
242+
.. image:: images/how_to_detect_branches_21_0.png
35.2 KB
Loading
14.8 KB
Loading
99.4 KB
Loading
75.4 KB
Loading
60.2 KB
Loading
60.2 KB
Loading
38.6 KB
Loading

0 commit comments

Comments
 (0)