Add Additional Cluster Stats and leaf cluster selection strategy by inhesrom · Pull Request #45 · tom-whitehead/hdbscan

inhesrom · 2026-03-27T03:07:41Z

This pull request introduces a major enhancement to the Rust HDBSCAN implementation by adding detailed clustering diagnostics and soft cluster membership vectors, as well as support for alternative cluster selection methods. The changes provide a new HdbscanResult struct, new APIs for detailed clustering, and the ability to select clusters using either the "Excess of Mass" (EOM, default) or "Leaf" methods. Additionally, the internal codebase is refactored for clarity and extensibility.

Full disclosure these commits are heavily agentic (Claude Opus 4.6 thinking on high mode), guided carefully by me.

The main value proposition of this PR is that is exposes internals of each cluster, which can be useful for a variety of reasons. I use it to prune off points in a cluster that are low probability for example.

Major new features and diagnostics:

Introduced the HdbscanResult struct in src/cluster_result.rs, which exposes cluster labels, membership probabilities, the condensed tree, outlier scores (GLOSH), and soft cluster membership vectors via the all_points_membership_vectors method, closely matching Python's HDBSCAN diagnostics.
Added cluster_detailed and cluster_detailed_par methods to the Hdbscan struct, providing detailed clustering results and diagnostics in both serial and parallel modes. [1] [2]

Cluster selection improvements:

Added the ClusterSelectionMethod enum to src/hyper_parameters.rs and extended HdbscanHyperParams to allow choosing between "EOM" and "Leaf" cluster selection strategies, with "EOM" as the default. [1] [2]
Refactored the cluster selection logic in src/hdbscan.rs to support both EOM and Leaf methods, and ensured epsilon filtering is applied consistently. [1] [2]

Internal refactoring and API improvements:

Refactored the clustering pipeline into reusable functions (run_pipeline, build_detailed_result) and added new methods for computing probabilities, outlier scores, and cluster death lambdas, improving code clarity and maintainability.
Improved the documentation and structure of CondensedNode in src/data_wrappers.rs to clarify its correspondence with the Python HDBSCAN condensed tree DataFrame.

Versioning:

Bumped the crate version in Cargo.toml from 0.12.0 to 0.13.0 to reflect these significant new features and API changes.

Add Leaf cluster selection method (ClusterSelectionMethod::Leaf) alongside the existing EOM method. Includes 11 new tests covering allow_single_cluster interaction, detailed mode (probabilities/outlier scores/membership vectors), non-Euclidean metrics, precalculated distances, epsilon filtering, and default-is-EOM verification. Bump version to 0.13.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When points share identical coordinates, core distances become 0 and lambda = 1/0 = infinity. Dividing infinity by infinity produces NaN in compute_probabilities, compute_outlier_scores, and membership vectors. Guard against infinite lambdas: duplicate points get probability 1.0 (maximally connected) and outlier score 0.0 (not outliers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tom-whitehead · 2026-04-06T20:45:34Z

Hey @inhesrom sorry I completely missed this in a mountain of work github notifications.

Thanks for raising this, there's a lot to go through here but I will try to go through it soon. It might be that I break it down into smaller changes as some parts will be easier to merge quickly (leaf cluster selection) than others that make wholesale changes to the API and require me to think a little more (i.e. the detailed clustering methods

What is the motivation for the cluster_detailed method and parallel equivalent? Is it purely to be able to expose the necessary data to do the soft clustering?
I guess you also expose the probabilities this way too, which is something I had been meaning to add when I did a major version bump, as I wanted to return new clustering result object from the cluster and cluster_par methods with labels, probabilities, centroids, etc. on.

inhesrom · 2026-04-06T21:19:50Z

Hey @inhesrom sorry I completely missed this in a mountain of work github notifications.

Thanks for raising this, there's a lot to go through here but I will try to go through it soon. It might be that I break it down into smaller changes as some parts will be easier to merge quickly (leaf cluster selection) than others that make wholesale changes to the API and require me to think a little more (i.e. the detailed clustering methods

What is the motivation for the cluster_detailed method and parallel equivalent? Is it purely to be able to expose the necessary data to do the soft clustering? I guess you also expose the probabilities this way too, which is something I had been meaning to add when I did a major version bump, as I wanted to return new clustering result object from the cluster and cluster_par methods with labels, probabilities, centroids, etc. on.

Thanks for taking a look! Main motivation was getting the probabilities and other metrics out similar to how sklearn does it https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html to allow manual selection and manipulation/pruning of cluster points. For example if you want to prune off all points in a cluster that are below 5% probabilty etc.

tom-whitehead

Hey @inhesrom, so far I've only really had time to review the probabilities calculation, but I will keep dropping in more comments slowly as I work my way through this. Sorry, this will take a bit of time. Feel free to ignore them all until I'm done or to address them as they come.

Overall, I like a lot of the changes and am happy to have them in the crate (probabilities – I had tried to add these myself once – and outlier scores). However, I'm still unsure about whether I want the soft clustering in the crate. It looks like it was never incorporated into the scikit-learn HDBSCAN and in the original hdbscan package it's no longer maintained and was always buggy.

On a separate note, if I ever have time/motivation, I will completely change the API when I do a major version bump and will make wholesale changes to what you have here. I want to remove HdbscanHyperParams and just have a builder pattern on the main Hdbscan object which will configure a) hyper parameters and b) other non-core calculations such as probabilities. Then we will only need the cluster and cluster_par methods and these will return HdbscanResult with optional fields such as probabilities, present only if they were requested at construction.

tom-whitehead · 2026-04-09T11:35:26Z

src/hdbscan.rs

+                continue;
+            }
+            let (parent_cluster, point_lambda) = point_info[p];
+            let cluster_max = max_lambda.get(&parent_cluster).copied().unwrap_or(T::one());


I don't think this probability calculation is quite right: it deviates from sklearn's.

Essentially you're normalising the point's lambda by it's immediate parent cluster's lambda. This works when the immediate parent cluster is the winning cluster, however sometimes the parent of the parent cluster (and so on) is the one selected (i.e. some child clusters were merged into a larger cluster). In this latter case the probabilities are not correct as we need to normalise by the ultimate winning cluster.

Fortunately this mapping essentially exists already in winning_clusters vector where the index is the cluster label id.

Example code
Use this if you want, but I actually coded this up when I was trying to figure out why it deviated from sklearn. Inspiration comes from sklearn's [cython routine](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_hdbscan/_tree.pyx#L515).
fn compute_probabilities( &self, labels: &[i32], winning_clusters: &[usize], condensed_tree: &[CondensedNode<T>], ) -> Vec<T> { // Max lambda per cluster: max lambda_birth among all direct children let mut max_lambda: HashMap<usize, T> = HashMap::new(); for node in condensed_tree { let entry = max_lambda.entry(node.parent_node_id).or_insert(T::zero()); if node.lambda_birth > *entry { *entry = node.lambda_birth; } } let mut probabilities = vec![T::zero(); self.n_samples]; for node in condensed_tree { let point_id = node.node_id; if point_id >= self.n_samples { continue; } let label = labels[point_id]; if label == -1 { continue; } let point_lambda = node.lambda_birth; let winning_cluster_id = winning_clusters[label as usize]; let cluster_max = max_lambda .get(&winning_cluster_id) .copied() .unwrap_or(T::one()); if cluster_max.is_infinite() { // All points merged at distance 0 (duplicates) — maximally connected. probabilities[point_id] = T::one(); } else if cluster_max > T::zero() { let capped = if point_lambda < cluster_max { point_lambda } else { cluster_max }; probabilities[point_id] = capped / cluster_max; } } probabilities }

inhesrom and others added 2 commits March 20, 2026 14:27

tom-whitehead reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Additional Cluster Stats and leaf cluster selection strategy#45

Add Additional Cluster Stats and leaf cluster selection strategy#45
inhesrom wants to merge 2 commits intotom-whitehead:masterfrom
inhesrom:master

inhesrom commented Mar 27, 2026 •

edited

Loading

Uh oh!

tom-whitehead commented Apr 6, 2026

Uh oh!

inhesrom commented Apr 6, 2026

Uh oh!

tom-whitehead left a comment

Uh oh!

tom-whitehead Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

inhesrom commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tom-whitehead commented Apr 6, 2026

Uh oh!

inhesrom commented Apr 6, 2026

Uh oh!

tom-whitehead left a comment

Choose a reason for hiding this comment

Uh oh!

tom-whitehead Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

inhesrom commented Mar 27, 2026 •

edited

Loading