Add Additional Cluster Stats and leaf cluster selection strategy#45
Add Additional Cluster Stats and leaf cluster selection strategy#45inhesrom wants to merge 2 commits intotom-whitehead:masterfrom
Conversation
Add Leaf cluster selection method (ClusterSelectionMethod::Leaf) alongside the existing EOM method. Includes 11 new tests covering allow_single_cluster interaction, detailed mode (probabilities/outlier scores/membership vectors), non-Euclidean metrics, precalculated distances, epsilon filtering, and default-is-EOM verification. Bump version to 0.13.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When points share identical coordinates, core distances become 0 and lambda = 1/0 = infinity. Dividing infinity by infinity produces NaN in compute_probabilities, compute_outlier_scores, and membership vectors. Guard against infinite lambdas: duplicate points get probability 1.0 (maximally connected) and outlier score 0.0 (not outliers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hey @inhesrom sorry I completely missed this in a mountain of work github notifications. Thanks for raising this, there's a lot to go through here but I will try to go through it soon. It might be that I break it down into smaller changes as some parts will be easier to merge quickly (leaf cluster selection) than others that make wholesale changes to the API and require me to think a little more (i.e. the What is the motivation for the |
Thanks for taking a look! Main motivation was getting the probabilities and other metrics out similar to how sklearn does it https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html to allow manual selection and manipulation/pruning of cluster points. For example if you want to prune off all points in a cluster that are below 5% probabilty etc. |
tom-whitehead
left a comment
There was a problem hiding this comment.
Hey @inhesrom, so far I've only really had time to review the probabilities calculation, but I will keep dropping in more comments slowly as I work my way through this. Sorry, this will take a bit of time. Feel free to ignore them all until I'm done or to address them as they come.
Overall, I like a lot of the changes and am happy to have them in the crate (probabilities – I had tried to add these myself once – and outlier scores). However, I'm still unsure about whether I want the soft clustering in the crate. It looks like it was never incorporated into the scikit-learn HDBSCAN and in the original hdbscan package it's no longer maintained and was always buggy.
On a separate note, if I ever have time/motivation, I will completely change the API when I do a major version bump and will make wholesale changes to what you have here. I want to remove HdbscanHyperParams and just have a builder pattern on the main Hdbscan object which will configure a) hyper parameters and b) other non-core calculations such as probabilities. Then we will only need the cluster and cluster_par methods and these will return HdbscanResult with optional fields such as probabilities, present only if they were requested at construction.
| continue; | ||
| } | ||
| let (parent_cluster, point_lambda) = point_info[p]; | ||
| let cluster_max = max_lambda.get(&parent_cluster).copied().unwrap_or(T::one()); |
There was a problem hiding this comment.
I don't think this probability calculation is quite right: it deviates from sklearn's.
Essentially you're normalising the point's lambda by it's immediate parent cluster's lambda. This works when the immediate parent cluster is the winning cluster, however sometimes the parent of the parent cluster (and so on) is the one selected (i.e. some child clusters were merged into a larger cluster). In this latter case the probabilities are not correct as we need to normalise by the ultimate winning cluster.
Fortunately this mapping essentially exists already in winning_clusters vector where the index is the cluster label id.
Example code
Use this if you want, but I actually coded this up when I was trying to figure out why it deviated from sklearn. Inspiration comes from sklearn's [cython routine](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_hdbscan/_tree.pyx#L515). fn compute_probabilities(
&self,
labels: &[i32],
winning_clusters: &[usize],
condensed_tree: &[CondensedNode<T>],
) -> Vec<T> {
// Max lambda per cluster: max lambda_birth among all direct children
let mut max_lambda: HashMap<usize, T> = HashMap::new();
for node in condensed_tree {
let entry = max_lambda.entry(node.parent_node_id).or_insert(T::zero());
if node.lambda_birth > *entry {
*entry = node.lambda_birth;
}
}
let mut probabilities = vec![T::zero(); self.n_samples];
for node in condensed_tree {
let point_id = node.node_id;
if point_id >= self.n_samples {
continue;
}
let label = labels[point_id];
if label == -1 {
continue;
}
let point_lambda = node.lambda_birth;
let winning_cluster_id = winning_clusters[label as usize];
let cluster_max = max_lambda
.get(&winning_cluster_id)
.copied()
.unwrap_or(T::one());
if cluster_max.is_infinite() {
// All points merged at distance 0 (duplicates) — maximally connected.
probabilities[point_id] = T::one();
} else if cluster_max > T::zero() {
let capped = if point_lambda < cluster_max { point_lambda } else { cluster_max };
probabilities[point_id] = capped / cluster_max;
}
}
probabilities
}
This pull request introduces a major enhancement to the Rust HDBSCAN implementation by adding detailed clustering diagnostics and soft cluster membership vectors, as well as support for alternative cluster selection methods. The changes provide a new
HdbscanResultstruct, new APIs for detailed clustering, and the ability to select clusters using either the "Excess of Mass" (EOM, default) or "Leaf" methods. Additionally, the internal codebase is refactored for clarity and extensibility.Full disclosure these commits are heavily agentic (Claude Opus 4.6 thinking on high mode), guided carefully by me.
The main value proposition of this PR is that is exposes internals of each cluster, which can be useful for a variety of reasons. I use it to prune off points in a cluster that are low probability for example.
Major new features and diagnostics:
HdbscanResultstruct insrc/cluster_result.rs, which exposes cluster labels, membership probabilities, the condensed tree, outlier scores (GLOSH), and soft cluster membership vectors via theall_points_membership_vectorsmethod, closely matching Python's HDBSCAN diagnostics.cluster_detailedandcluster_detailed_parmethods to theHdbscanstruct, providing detailed clustering results and diagnostics in both serial and parallel modes. [1] [2]Cluster selection improvements:
ClusterSelectionMethodenum tosrc/hyper_parameters.rsand extendedHdbscanHyperParamsto allow choosing between "EOM" and "Leaf" cluster selection strategies, with "EOM" as the default. [1] [2]src/hdbscan.rsto support both EOM and Leaf methods, and ensured epsilon filtering is applied consistently. [1] [2]Internal refactoring and API improvements:
run_pipeline,build_detailed_result) and added new methods for computing probabilities, outlier scores, and cluster death lambdas, improving code clarity and maintainability.CondensedNodeinsrc/data_wrappers.rsto clarify its correspondence with the Python HDBSCAN condensed tree DataFrame.Versioning:
Cargo.tomlfrom 0.12.0 to 0.13.0 to reflect these significant new features and API changes.