Merge pull request #654 from JelmerBot/dev/flasc-fixes

lmcinnes · web-flow · commit 5559983365e3 · 2024-08-15T17:14:21.000-04:00
Fix typo's and avoid internal numpy API in branch detection code.
diff --git a/docs/how_to_detect_branches.rst b/docs/how_to_detect_branches.rst
@@ -1,4 +1,4 @@
-How to detect banches in clusters
+How to detect branches in clusters
 =================================
 
 HDBSCAN\* is often used to find subpopulations in exploratory data
@@ -14,20 +14,21 @@ does not inform us of the branching structure:
 .. image:: images/how_to_detect_branches_3_0.png
 
 Alternatively, HDBSCAN\*’s leaf clusters provide more detail. They
-segment the points of different branches into distint clusters. However,
+segment the points of different branches into distinct clusters. However,
 the partitioning and cluster hierarchy does not (necessarily) tell us how
 those clusters combine into a larger shape.
 
 .. image:: images/how_to_detect_branches_5_0.png
 
 This is where the branch detection post-processing step comes into play.
 The functionality is described in detail by `Bot et
-al <https://arxiv.org/abs/2311.15887>`__. It operates on the detected
-clusters and extracts a branch-hierarchy analogous to HDBSCAN\*’s
-condensed cluster hierarchy. The process is very similar to HDBSCAN\*
-clustering, except that it operates on an in-cluster eccentricity rather
-than a density measure. Where peaks in a density profile correspond to
-clusters, the peaks in an eccentricity profile correspond to branches:
+al <https://arxiv.org/abs/2311.15887>`__ (please reference this paper when using
+this functionality). It operates on the detected clusters and extracts a
+branch-hierarchy analogous to HDBSCAN\*'s condensed cluster hierarchy. The
+process is very similar to HDBSCAN\* clustering, except that it operates on an
+in-cluster eccentricity rather than a density measure. Where peaks in a density
+profile correspond to clusters, the peaks in an eccentricity profile correspond
+to branches:
 
 .. image:: images/how_to_detect_branches_7_0.png
 
@@ -41,11 +42,18 @@ The resulting partitioning reflects subgroups for clusters and their
 branches:
 
 .. code:: python
+    
     from hdbscan import HDBSCAN, BranchDetector
 
     clusterer = HDBSCAN(min_cluster_size=15, branch_detection_data=True).fit(data)
     branch_detector = BranchDetector(min_branch_size=15).fit(clusterer)
-    plot(branch_detector.labels_)
+    
+    # Plot labels
+    plt.scatter(data[:, 0], data[:, 1], 1, color=[
+        "silver" if l < 0 else f"C{l % 10}" for l in branch_detector.labels_
+    ])
+    plt.axis("off")
+    plt.show()
 
 .. image:: images/how_to_detect_branches_9_0.png
 
@@ -75,7 +83,7 @@ Most guidelines for tuning HDBSCAN\* also apply for the branch detector:
    ``allow_single_cluster`` and mostly affects the EOM selection
    strategy. When enabled, clusters with bifurcations will be given a
    single label if the root segment contains most eccentricity mass
-   (i.e., branches already merge far from the center and most poinst are
+   (i.e., branches already merge far from the center and most points are
    central).
 -  ``max_branch_size`` behaves like HDBSCAN\*’s ``max_cluster_size`` and
    mostly affects the EOM selection strategy. Branches with more than
@@ -99,7 +107,7 @@ Two parameters are unique to the ``BranchDetector`` class:
       all ``min_samples``-nearest neighbours.
    -  The ``"full"`` method connects all points with a mutual
       reachability lower than the maximum distance in the cluster’s MST.
-      It represents all connectity at the moment the last point joins
+      It represents all connectivity at the moment the last point joins
       the cluster. 
     
    These methods differ in their sensitivity, noise robustness, and 
@@ -143,7 +151,7 @@ cluster.
 
 The length of the branches also says something about the compactness /
 elongatedness of clusters. For example, the branch hierarchy for the
-orange ~-shaped cluster is quite different from the same hierarcy for
+orange ~-shaped cluster is quite different from the same hierarchy for
 the central o-shaped cluster.
 
 .. code:: python
diff --git a/hdbscan/plots.py b/hdbscan/plots.py
@@ -948,36 +948,46 @@ def __init__(
         branch_probabilities,
         raw_data=None,
     ):
-        self._edges = np.core.records.fromarrays(
-            np.hstack(
-                (
-                    np.concatenate(approximation_graphs),
-                    np.repeat(
-                        np.arange(len(approximation_graphs)),
-                        [g.shape[0] for g in approximation_graphs],
-                    )[None].T,
-                )
-            ).transpose(),
-            names="parent, child, centrality, mutual_reachability, cluster",
-            formats="intp, intp, double, double, intp",
+        self._edges = np.array(
+            [
+                (edge[0], edge[1], edge[2], edge[3], cluster)
+                for cluster, edges in enumerate(approximation_graphs)
+                for edge in edges
+            ],
+            dtype=[
+                ("parent", np.intp),
+                ("child", np.intp),
+                ("centrality", np.float64),
+                ("mutual_reachability", np.float64),
+                ("cluster", np.intp),
+            ],
         )
         self.point_mask = cluster_labels >= 0
         self._raw_data = raw_data[self.point_mask, :] if raw_data is not None else None
-        self._points = np.core.records.fromarrays(
-            np.vstack(
+        self._points = np.array(
+            [
                 (
-                    np.where(self.point_mask)[0],
-                    labels[self.point_mask],
-                    probabilities[self.point_mask],
-                    cluster_labels[self.point_mask],
-                    cluster_probabilities[self.point_mask],
-                    cluster_centralities[self.point_mask],
-                    branch_labels[self.point_mask],
-                    branch_probabilities[self.point_mask],
+                    i,
+                    labels[i],
+                    probabilities[i],
+                    cluster_labels[i],
+                    cluster_probabilities[i],
+                    cluster_centralities[i],
+                    branch_labels[i],
+                    branch_probabilities[i],
                 )
-            ),
-            names="id, label, probability, cluster_label, cluster_probability, cluster_centrality, branch_label, branch_probability",
-            formats="intp, intp, double, intp, double, double, intp, double",
+                for i in np.where(self.point_mask)[0]
+            ],
+            dtype=[
+                ("id", np.intp),
+                ("label", np.intp),
+                ("probability", np.float64),
+                ("cluster_label", np.intp),
+                ("cluster_probability", np.float64),
+                ("cluster_centrality", np.float64),
+                ("branch_label", np.intp),
+                ("branch_probability", np.float64),
+            ],
         )
         self._pos = None
 
diff --git a/notebooks/How to detect branches.ipynb b/notebooks/How to detect branches.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# How to detect banches in clusters\n",
+    "# How to detect branches in clusters\n",
     "\n",
     "HDBSCAN\\* is often used to find subpopulations in exploratory data analysis\n",
     "workflows. Not only clusters themselves, but also their shape can represent\n",
@@ -107,7 +107,7 @@
    "metadata": {},
    "source": [
     "Alternatively, HDBSCAN\\*'s leaf clusters provide more detail. They segment the\n",
-    "points of different branches into distint clusters. However, the partitioning\n",
+    "points of different branches into distinct clusters. However, the partitioning\n",
     "and cluster hierarchy does not (necessarily) tell us how those clusters combine\n",
     "into a larger shape."
    ]
@@ -143,12 +143,13 @@
    "source": [
     "This is where the branch detection post-processing step comes into play. The\n",
     "functionality is described in detail by [Bot et\n",
-    "al](https://arxiv.org/abs/2311.15887). It operates on the detected clusters and\n",
-    "extracts a branch-hierarchy analogous to HDBSCAN*'s condensed cluster hierarchy.\n",
-    "The process is very similar to HDBSCAN* clustering, except that it operates on\n",
-    "an in-cluster eccentricity rather than a density measure. Where peaks in a\n",
-    "density profile correspond to clusters, the peaks in an eccentricity profile\n",
-    "correspond to branches:"
+    "al](https://arxiv.org/abs/2311.15887) (please reference this paper when using\n",
+    "this functionality). It operates on the detected clusters and extracts a\n",
+    "branch-hierarchy analogous to HDBSCAN\\*'s condensed cluster hierarchy. The\n",
+    "process is very similar to HDBSCAN\\* clustering, except that it operates on an\n",
+    "in-cluster eccentricity rather than a density measure. Where peaks in a density\n",
+    "profile correspond to clusters, the peaks in an eccentricity profile correspond\n",
+    "to branches:"
    ]
   },
   {
@@ -269,7 +270,7 @@
     "  mostly affects the EOM selection strategy. When enabled, clusters with\n",
     "  bifurcations will be given a single label if the root segment contains most\n",
     "  eccentricity mass (i.e., branches already merge far from the center and most\n",
-    "  poinst are central).\n",
+    "  points are central).\n",
     "- `max_branch_size` behaves like HDBSCAN\\*'s `max_cluster_size` and mostly\n",
     "  affects the EOM selection strategy. Branches with more than the specified\n",
     "  number of points are skipped, selecting their descendants in the hierarchy\n",
@@ -288,7 +289,7 @@
     "    minimum spanning tree under HDBSCAN\\*'s mutual reachability distance. This\n",
     "    graph contains the detected MST and all `min_samples`-nearest neighbours. \n",
     "  - The `\"full\"` method connects all points with a mutual reachability lower\n",
-    "    than the maximum distance in the cluster's MST. It represents all connectity\n",
+    "    than the maximum distance in the cluster's MST. It represents all connectivity\n",
     "    at the moment the last point joins the cluster. These methods differ in\n",
     "  their sensitivity, noise robustness, and computational cost. The `\"core\"`\n",
     "  method usually needs slightly higher `min_branch_size` values to suppress\n",
@@ -348,7 +349,7 @@
    "source": [
     "The length of the branches also says something about the compactness /\n",
     "elongatedness of clusters. For example, the branch hierarchy for the orange\n",
-    "~-shaped cluster is quite different from the same hierarcy for the central\n",
+    "~-shaped cluster is quite different from the same hierarchy for the central\n",
     "o-shaped cluster."
    ]
   },