Skip to content

Add Additional Cluster Stats and leaf cluster selection strategy#45

Open
inhesrom wants to merge 2 commits intotom-whitehead:masterfrom
inhesrom:master
Open

Add Additional Cluster Stats and leaf cluster selection strategy#45
inhesrom wants to merge 2 commits intotom-whitehead:masterfrom
inhesrom:master

Conversation

@inhesrom
Copy link
Copy Markdown

@inhesrom inhesrom commented Mar 27, 2026

This pull request introduces a major enhancement to the Rust HDBSCAN implementation by adding detailed clustering diagnostics and soft cluster membership vectors, as well as support for alternative cluster selection methods. The changes provide a new HdbscanResult struct, new APIs for detailed clustering, and the ability to select clusters using either the "Excess of Mass" (EOM, default) or "Leaf" methods. Additionally, the internal codebase is refactored for clarity and extensibility.

Full disclosure these commits are heavily agentic (Claude Opus 4.6 thinking on high mode), guided carefully by me.

The main value proposition of this PR is that is exposes internals of each cluster, which can be useful for a variety of reasons. I use it to prune off points in a cluster that are low probability for example.

Major new features and diagnostics:

  • Introduced the HdbscanResult struct in src/cluster_result.rs, which exposes cluster labels, membership probabilities, the condensed tree, outlier scores (GLOSH), and soft cluster membership vectors via the all_points_membership_vectors method, closely matching Python's HDBSCAN diagnostics.
  • Added cluster_detailed and cluster_detailed_par methods to the Hdbscan struct, providing detailed clustering results and diagnostics in both serial and parallel modes. [1] [2]

Cluster selection improvements:

  • Added the ClusterSelectionMethod enum to src/hyper_parameters.rs and extended HdbscanHyperParams to allow choosing between "EOM" and "Leaf" cluster selection strategies, with "EOM" as the default. [1] [2]
  • Refactored the cluster selection logic in src/hdbscan.rs to support both EOM and Leaf methods, and ensured epsilon filtering is applied consistently. [1] [2]

Internal refactoring and API improvements:

  • Refactored the clustering pipeline into reusable functions (run_pipeline, build_detailed_result) and added new methods for computing probabilities, outlier scores, and cluster death lambdas, improving code clarity and maintainability.
  • Improved the documentation and structure of CondensedNode in src/data_wrappers.rs to clarify its correspondence with the Python HDBSCAN condensed tree DataFrame.

Versioning:

  • Bumped the crate version in Cargo.toml from 0.12.0 to 0.13.0 to reflect these significant new features and API changes.

inhesrom and others added 2 commits March 20, 2026 14:27
Add Leaf cluster selection method (ClusterSelectionMethod::Leaf) alongside
the existing EOM method. Includes 11 new tests covering allow_single_cluster
interaction, detailed mode (probabilities/outlier scores/membership vectors),
non-Euclidean metrics, precalculated distances, epsilon filtering, and
default-is-EOM verification. Bump version to 0.13.0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When points share identical coordinates, core distances become 0 and
lambda = 1/0 = infinity. Dividing infinity by infinity produces NaN in
compute_probabilities, compute_outlier_scores, and membership vectors.

Guard against infinite lambdas: duplicate points get probability 1.0
(maximally connected) and outlier score 0.0 (not outliers).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tom-whitehead
Copy link
Copy Markdown
Owner

Hey @inhesrom sorry I completely missed this in a mountain of work github notifications.

Thanks for raising this, there's a lot to go through here but I will try to go through it soon. It might be that I break it down into smaller changes as some parts will be easier to merge quickly (leaf cluster selection) than others that make wholesale changes to the API and require me to think a little more (i.e. the detailed clustering methods

What is the motivation for the cluster_detailed method and parallel equivalent? Is it purely to be able to expose the necessary data to do the soft clustering?
I guess you also expose the probabilities this way too, which is something I had been meaning to add when I did a major version bump, as I wanted to return new clustering result object from the cluster and cluster_par methods with labels, probabilities, centroids, etc. on.

@inhesrom
Copy link
Copy Markdown
Author

inhesrom commented Apr 6, 2026

Hey @inhesrom sorry I completely missed this in a mountain of work github notifications.

Thanks for raising this, there's a lot to go through here but I will try to go through it soon. It might be that I break it down into smaller changes as some parts will be easier to merge quickly (leaf cluster selection) than others that make wholesale changes to the API and require me to think a little more (i.e. the detailed clustering methods

What is the motivation for the cluster_detailed method and parallel equivalent? Is it purely to be able to expose the necessary data to do the soft clustering? I guess you also expose the probabilities this way too, which is something I had been meaning to add when I did a major version bump, as I wanted to return new clustering result object from the cluster and cluster_par methods with labels, probabilities, centroids, etc. on.

Thanks for taking a look! Main motivation was getting the probabilities and other metrics out similar to how sklearn does it https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html to allow manual selection and manipulation/pruning of cluster points. For example if you want to prune off all points in a cluster that are below 5% probabilty etc.

Copy link
Copy Markdown
Owner

@tom-whitehead tom-whitehead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @inhesrom, so far I've only really had time to review the probabilities calculation, but I will keep dropping in more comments slowly as I work my way through this. Sorry, this will take a bit of time. Feel free to ignore them all until I'm done or to address them as they come.

Overall, I like a lot of the changes and am happy to have them in the crate (probabilities – I had tried to add these myself once – and outlier scores). However, I'm still unsure about whether I want the soft clustering in the crate. It looks like it was never incorporated into the scikit-learn HDBSCAN and in the original hdbscan package it's no longer maintained and was always buggy.

On a separate note, if I ever have time/motivation, I will completely change the API when I do a major version bump and will make wholesale changes to what you have here. I want to remove HdbscanHyperParams and just have a builder pattern on the main Hdbscan object which will configure a) hyper parameters and b) other non-core calculations such as probabilities. Then we will only need the cluster and cluster_par methods and these will return HdbscanResult with optional fields such as probabilities, present only if they were requested at construction.

continue;
}
let (parent_cluster, point_lambda) = point_info[p];
let cluster_max = max_lambda.get(&parent_cluster).copied().unwrap_or(T::one());
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this probability calculation is quite right: it deviates from sklearn's.

Essentially you're normalising the point's lambda by it's immediate parent cluster's lambda. This works when the immediate parent cluster is the winning cluster, however sometimes the parent of the parent cluster (and so on) is the one selected (i.e. some child clusters were merged into a larger cluster). In this latter case the probabilities are not correct as we need to normalise by the ultimate winning cluster.

Fortunately this mapping essentially exists already in winning_clusters vector where the index is the cluster label id.

Example code Use this if you want, but I actually coded this up when I was trying to figure out why it deviated from sklearn. Inspiration comes from sklearn's [cython routine](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_hdbscan/_tree.pyx#L515).
    fn compute_probabilities(
        &self,
        labels: &[i32],
        winning_clusters: &[usize],
        condensed_tree: &[CondensedNode<T>],
    ) -> Vec<T> {
        // Max lambda per cluster: max lambda_birth among all direct children
        let mut max_lambda: HashMap<usize, T> = HashMap::new();
        for node in condensed_tree {
            let entry = max_lambda.entry(node.parent_node_id).or_insert(T::zero());
            if node.lambda_birth > *entry {
                *entry = node.lambda_birth;
            }
        }

        let mut probabilities = vec![T::zero(); self.n_samples];
        for node in condensed_tree {
            let point_id = node.node_id;
            if point_id >= self.n_samples {
                continue;
            }

            let label = labels[point_id];
            if label == -1 {
                continue;
            }
            let point_lambda = node.lambda_birth;
            let winning_cluster_id = winning_clusters[label as usize];
            let cluster_max = max_lambda
                .get(&winning_cluster_id)
                .copied()
                .unwrap_or(T::one());

            if cluster_max.is_infinite() {
                // All points merged at distance 0 (duplicates) — maximally connected.
                probabilities[point_id] = T::one();
            } else if cluster_max > T::zero() {
                let capped = if point_lambda < cluster_max { point_lambda } else { cluster_max };
                probabilities[point_id] = capped / cluster_max;
            }
        }
        probabilities
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants