|
| 1 | +How to detect banches in clusters |
| 2 | +================================= |
| 3 | + |
| 4 | +HDBSCAN\* is often used to find subpopulations in exploratory data |
| 5 | +analysis workflows. Not only clusters themselves, but also their shape |
| 6 | +can represent meaningful subpopulations. For example, a Y-shaped cluster |
| 7 | +may represent an evolving process with two distinct end-states. |
| 8 | +Detecting these branches can reveal interesting patterns that are not |
| 9 | +captured by density-based clustering. |
| 10 | + |
| 11 | +For example, HDBSCAN\* finds 4 clusters in the datasets below, which |
| 12 | +does not inform us of the branching structure: |
| 13 | + |
| 14 | +.. image:: images/how_to_detect_branches_3_0.png |
| 15 | + |
| 16 | +Alternatively, HDBSCAN\*’s leaf clusters provide more detail. They |
| 17 | +segment the points of different branches into distint clusters. However, |
| 18 | +the partitioning and cluster hierarchy does not (necessarily) tell us how |
| 19 | +those clusters combine into a larger shape. |
| 20 | + |
| 21 | +.. image:: images/how_to_detect_branches_5_0.png |
| 22 | + |
| 23 | +This is where the branch detection post-processing step comes into play. |
| 24 | +The functionality is described in detail by `Bot et |
| 25 | +al <https://arxiv.org/abs/2311.15887>`__. It operates on the detected |
| 26 | +clusters and extracts a branch-hierarchy analogous to HDBSCAN\*’s |
| 27 | +condensed cluster hierarchy. The process is very similar to HDBSCAN\* |
| 28 | +clustering, except that it operates on an in-cluster eccentricity rather |
| 29 | +than a density measure. Where peaks in a density profile correspond to |
| 30 | +clusters, the peaks in an eccentricity profile correspond to branches: |
| 31 | + |
| 32 | +.. image:: images/how_to_detect_branches_7_0.png |
| 33 | + |
| 34 | +Using the branch detection functionality is fairly straightforward. |
| 35 | +First, run hdbscan with parameter ``branch_detection_data=True``. This |
| 36 | +tells hdbscan to cache the internal data structures needed for the |
| 37 | +branch detection process. Then, configure the ``BranchDetector`` class |
| 38 | +and fit is with the HDBSCAN object. |
| 39 | + |
| 40 | +The resulting partitioning reflects subgroups for clusters and their |
| 41 | +branches: |
| 42 | + |
| 43 | +.. code:: python |
| 44 | + from hdbscan import HDBSCAN, BranchDetector |
| 45 | +
|
| 46 | + clusterer = HDBSCAN(min_cluster_size=15, branch_detection_data=True).fit(data) |
| 47 | + branch_detector = BranchDetector(min_branch_size=15).fit(clusterer) |
| 48 | + plot(branch_detector.labels_) |
| 49 | +
|
| 50 | +.. image:: images/how_to_detect_branches_9_0.png |
| 51 | + |
| 52 | + |
| 53 | +Parameter selection |
| 54 | +------------------- |
| 55 | + |
| 56 | +The ``BranchDetector``’s main parameters are very similar to HDBSCAN. |
| 57 | +Most guidelines for tuning HDBSCAN\* also apply for the branch detector: |
| 58 | + |
| 59 | +- ``min_branch_size`` behaves like HDBSCAN\*’s ``min_cluster_size``. It |
| 60 | + configures how many points branches need to contain. Values around 10 |
| 61 | + to 25 points tend to work well. Lower values are useful when looking |
| 62 | + for smaller structures. Higher values can be used to suppress noise |
| 63 | + if present. |
| 64 | +- ``branch_selection_method`` behaves like HDBSCAN\*’s |
| 65 | + ``cluster_selection_method``. The leaf and Excess of Mass (EOM) |
| 66 | + strategies are used to select branches from the condensed |
| 67 | + hierarchies. By default, branches are only reflected in the final |
| 68 | + labelling for clusters that have 3 or more branches (at least one |
| 69 | + bifurcation). |
| 70 | +- ``branch_selection_persistence`` replaces HDBSCAN\*’s |
| 71 | + ``cluster_selection_epsilon``. This parameter can be used to suppress |
| 72 | + branches with a short eccentricity range (y-range in the condensed |
| 73 | + hierarchy plot). |
| 74 | +- ``allow_single_branch`` behaves like HDBSCAN\*’s |
| 75 | + ``allow_single_cluster`` and mostly affects the EOM selection |
| 76 | + strategy. When enabled, clusters with bifurcations will be given a |
| 77 | + single label if the root segment contains most eccentricity mass |
| 78 | + (i.e., branches already merge far from the center and most poinst are |
| 79 | + central). |
| 80 | +- ``max_branch_size`` behaves like HDBSCAN\*’s ``max_cluster_size`` and |
| 81 | + mostly affects the EOM selection strategy. Branches with more than |
| 82 | + the specified number of points are skipped, selecting their |
| 83 | + descendants in the hierarchy instead. |
| 84 | + |
| 85 | +Two parameters are unique to the ``BranchDetector`` class: |
| 86 | + |
| 87 | +- ``branch_detection_method`` determines which points are connected |
| 88 | + within a cluster. Both density-based clustering and the branch detection |
| 89 | + process need to determine which points are part of the same |
| 90 | + density/eccentricity peak. HDBSCAN\* defines density in terms of the distance |
| 91 | + between points, providing natural way to define which points are connected at |
| 92 | + some density value. Eccentricity does not have such a connection. So, we use |
| 93 | + information from the clusters to determine which points should be connected |
| 94 | + instead. |
| 95 | + |
| 96 | + - The ``"core"`` method selects all edges that could be part of the |
| 97 | + cluster’s minimum spanning tree under HDBSCAN\*’s mutual |
| 98 | + reachability distance. This graph contains the detected MST and |
| 99 | + all ``min_samples``-nearest neighbours. |
| 100 | + - The ``"full"`` method connects all points with a mutual |
| 101 | + reachability lower than the maximum distance in the cluster’s MST. |
| 102 | + It represents all connectity at the moment the last point joins |
| 103 | + the cluster. |
| 104 | + |
| 105 | + These methods differ in their sensitivity, noise robustness, and |
| 106 | + computational cost. The ``"core"`` method usually needs slightly |
| 107 | + higher ``min_branch_size`` values to suppress noisy branches than the |
| 108 | + ``"full"`` method. It is a good choice when branches span large |
| 109 | + density ranges. |
| 110 | + |
| 111 | +- ``label_sides_as_branches`` determines whether the sides of an |
| 112 | + elongated cluster without bifurcations (l-shape) are represented as |
| 113 | + distinct subgroups. By default a cluster needs to have one |
| 114 | + bifurcation (Y-shape) before the detected branches are represented in |
| 115 | + the final labelling. |
| 116 | + |
| 117 | + |
| 118 | +Useful attributes |
| 119 | +----------------- |
| 120 | + |
| 121 | +Like the HDBSCAN class, the BranchDetector class contains several useful |
| 122 | +attributes for exploring datasets. |
| 123 | + |
| 124 | +Branch hierarchy |
| 125 | +~~~~~~~~~~~~~~~~ |
| 126 | + |
| 127 | +Branch hierarchies reflect the tree-shape of clusters. Like the cluster |
| 128 | +hierarchy, branch hierarchies can be used to interpret which branches |
| 129 | +exist. In addition, they reflect how far apart branches merge into the |
| 130 | +cluster. |
| 131 | + |
| 132 | +.. code:: python |
| 133 | +
|
| 134 | + idx = np.argmax([len(x) for x in branch_detector.branch_persistences_]) |
| 135 | + branch_detector.cluster_condensed_trees_[idx].plot( |
| 136 | + select_clusters=True, selection_palette=["C3", "C4", "C5"] |
| 137 | + ) |
| 138 | + plt.ylabel("Eccentricity") |
| 139 | + plt.title(f"Branches in cluster {idx}") |
| 140 | + plt.show() |
| 141 | +
|
| 142 | +.. image:: images/how_to_detect_branches_13_0.png |
| 143 | + |
| 144 | +The length of the branches also says something about the compactness / |
| 145 | +elongatedness of clusters. For example, the branch hierarchy for the |
| 146 | +orange ~-shaped cluster is quite different from the same hierarcy for |
| 147 | +the central o-shaped cluster. |
| 148 | + |
| 149 | +.. code:: python |
| 150 | +
|
| 151 | + plt.figure(figsize=(6, 3)) |
| 152 | + plt.subplot(1, 2, 1) |
| 153 | + idx = np.argmin([min(*x) for x in branch_detector.branch_persistences_]) |
| 154 | + branch_detector.cluster_condensed_trees_[idx].plot(colorbar=False) |
| 155 | + plt.ylim([0.3, 0]) |
| 156 | + plt.ylabel("Eccentricity") |
| 157 | + plt.title(f"Cluster {idx} (spherical)") |
| 158 | + |
| 159 | + plt.subplot(1, 2, 2) |
| 160 | + idx = np.argmax([max(*x) for x in branch_detector.branch_persistences_]) |
| 161 | + branch_detector.cluster_condensed_trees_[idx].plot(colorbar=False) |
| 162 | + plt.ylim([0.3, 0]) |
| 163 | + plt.ylabel("Eccentricity") |
| 164 | + plt.title(f"Cluster {idx} (elongated)") |
| 165 | + plt.show() |
| 166 | +
|
| 167 | +.. image:: images/how_to_detect_branches_15_0.png |
| 168 | + |
| 169 | +Cluster approximation graphs |
| 170 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 171 | + |
| 172 | +Branches are detected using a graph that approximates the connectivity |
| 173 | +within a cluster. These graphs are available in the |
| 174 | +``cluster_approximation_graph_`` property and can be used to visualise |
| 175 | +data and the branch-detection process. The plotting function is based on |
| 176 | +the networkx API and uses networkx functionality to compute a layout if |
| 177 | +positions are not provided. Using UMAP to compute positions can be |
| 178 | +faster and more expressive. Several helper functions for exporting to |
| 179 | +numpy, pandas, and networkx are available. |
| 180 | + |
| 181 | +For example, a figure with points coloured by the final labelling: |
| 182 | + |
| 183 | +.. code:: python |
| 184 | +
|
| 185 | + g = branch_detector.cluster_approximation_graph_ |
| 186 | + g.plot(positions=data, node_size=5, edge_width=0.2, edge_alpha=0.2) |
| 187 | + plt.show() |
| 188 | +
|
| 189 | +.. image:: images/how_to_detect_branches_17_0.png |
| 190 | + |
| 191 | +Or, a figure with the edges coloured by centrality: |
| 192 | + |
| 193 | +.. code:: python |
| 194 | +
|
| 195 | + g.plot( |
| 196 | + positions=data, |
| 197 | + node_alpha=0, |
| 198 | + edge_color="centrality", |
| 199 | + edge_cmap="turbo", |
| 200 | + edge_width=0.2, |
| 201 | + edge_alpha=0.2, |
| 202 | + edge_vmax=100, |
| 203 | + ) |
| 204 | + plt.show() |
| 205 | +
|
| 206 | +.. image:: images/how_to_detect_branches_19_0.png |
| 207 | + |
| 208 | + |
| 209 | +Approximate predict |
| 210 | +------------------- |
| 211 | + |
| 212 | +A branch-aware ``approximate_predict_branch`` function is available to |
| 213 | +predicts branch labels for new points. This function uses a fitted |
| 214 | +BranchDetector object to first predict cluster labels and then the |
| 215 | +branch labels. |
| 216 | + |
| 217 | +.. code:: python |
| 218 | +
|
| 219 | + from hdbscan import approximate_predict_branch |
| 220 | + |
| 221 | + new_points = np.asarray([[0.4, 0.25], [0.23, 0.2], [-0.14, -0.2]]) |
| 222 | + clusterer.generate_prediction_data() |
| 223 | + labels, probs, cluster_labels, cluster_probs, branch_labels, branch_probs = ( |
| 224 | + approximate_predict_branch(branch_detector, new_points) |
| 225 | + ) |
| 226 | + |
| 227 | + plt.scatter( |
| 228 | + new_points.T[0], |
| 229 | + new_points.T[1], |
| 230 | + 140, |
| 231 | + labels % 10, |
| 232 | + marker="p", |
| 233 | + zorder=5, |
| 234 | + cmap="tab10", |
| 235 | + vmin=0, |
| 236 | + vmax=9, |
| 237 | + edgecolor="k", |
| 238 | + ) |
| 239 | + plot(branch_detector.labels_) |
| 240 | + plt.show() |
| 241 | +
|
| 242 | +.. image:: images/how_to_detect_branches_21_0.png |
0 commit comments