1- How to detect banches in clusters
1+ How to detect branches in clusters
22=================================
33
44HDBSCAN\* is often used to find subpopulations in exploratory data
@@ -14,20 +14,21 @@ does not inform us of the branching structure:
1414.. image :: images/how_to_detect_branches_3_0.png
1515
1616Alternatively, HDBSCAN\* ’s leaf clusters provide more detail. They
17- segment the points of different branches into distint clusters. However,
17+ segment the points of different branches into distinct clusters. However,
1818the partitioning and cluster hierarchy does not (necessarily) tell us how
1919those clusters combine into a larger shape.
2020
2121.. image :: images/how_to_detect_branches_5_0.png
2222
2323This is where the branch detection post-processing step comes into play.
2424The functionality is described in detail by `Bot et
25- al <https://arxiv.org/abs/2311.15887> `__. It operates on the detected
26- clusters and extracts a branch-hierarchy analogous to HDBSCAN\* ’s
27- condensed cluster hierarchy. The process is very similar to HDBSCAN\*
28- clustering, except that it operates on an in-cluster eccentricity rather
29- than a density measure. Where peaks in a density profile correspond to
30- clusters, the peaks in an eccentricity profile correspond to branches:
25+ al <https://arxiv.org/abs/2311.15887> `__ (please reference this paper when using
26+ this functionality). It operates on the detected clusters and extracts a
27+ branch-hierarchy analogous to HDBSCAN\* 's condensed cluster hierarchy. The
28+ process is very similar to HDBSCAN\* clustering, except that it operates on an
29+ in-cluster eccentricity rather than a density measure. Where peaks in a density
30+ profile correspond to clusters, the peaks in an eccentricity profile correspond
31+ to branches:
3132
3233.. image :: images/how_to_detect_branches_7_0.png
3334
@@ -41,11 +42,18 @@ The resulting partitioning reflects subgroups for clusters and their
4142branches:
4243
4344.. code :: python
45+
4446 from hdbscan import HDBSCAN , BranchDetector
4547
4648 clusterer = HDBSCAN(min_cluster_size = 15 , branch_detection_data = True ).fit(data)
4749 branch_detector = BranchDetector(min_branch_size = 15 ).fit(clusterer)
48- plot(branch_detector.labels_)
50+
51+ # Plot labels
52+ plt.scatter(data[:, 0 ], data[:, 1 ], 1 , color = [
53+ " silver" if l < 0 else f " C { l % 10 } " for l in branch_detector.labels_
54+ ])
55+ plt.axis(" off" )
56+ plt.show()
4957
5058 .. image :: images/how_to_detect_branches_9_0.png
5159
@@ -75,7 +83,7 @@ Most guidelines for tuning HDBSCAN\* also apply for the branch detector:
7583 ``allow_single_cluster `` and mostly affects the EOM selection
7684 strategy. When enabled, clusters with bifurcations will be given a
7785 single label if the root segment contains most eccentricity mass
78- (i.e., branches already merge far from the center and most poinst are
86+ (i.e., branches already merge far from the center and most points are
7987 central).
8088- ``max_branch_size `` behaves like HDBSCAN\* ’s ``max_cluster_size `` and
8189 mostly affects the EOM selection strategy. Branches with more than
@@ -99,7 +107,7 @@ Two parameters are unique to the ``BranchDetector`` class:
99107 all ``min_samples ``-nearest neighbours.
100108 - The ``"full" `` method connects all points with a mutual
101109 reachability lower than the maximum distance in the cluster’s MST.
102- It represents all connectity at the moment the last point joins
110+ It represents all connectivity at the moment the last point joins
103111 the cluster.
104112
105113 These methods differ in their sensitivity, noise robustness, and
@@ -143,7 +151,7 @@ cluster.
143151
144152The length of the branches also says something about the compactness /
145153elongatedness of clusters. For example, the branch hierarchy for the
146- orange ~-shaped cluster is quite different from the same hierarcy for
154+ orange ~-shaped cluster is quite different from the same hierarchy for
147155the central o-shaped cluster.
148156
149157.. code :: python
0 commit comments