How to obtain consistent clustering results by reusing single linkage tree with multiple min_cluster_size values?

I'm currently using HDBSCAN for clustering with a fixed min_samples parameter, and I need to test multiple different values of min_cluster_size. To speed up the process, I want to compute the single linkage tree only once (since min_samples is fixed), and then reuse this tree to efficiently obtain clustering results for various min_cluster_size values.

However, when I reuse the single_linkage_tree_ (converted via .to_numpy()) with _tree_to_labels() functions, the resulting labels are not always exactly the same as the labels obtained directly from running HDBSCAN().fit() with the same parameters. It seems that minor numerical differences or differences in tie-breaking within the internal Cython implementation vs. the Python interface cause this discrepancy.

Is there an officially recommended method, or a precise way, to reuse the single linkage tree to achieve exactly the same clustering labels as the direct .fit() calls across multiple min_cluster_size parameters?

Thank you very much in advance!

My code:

```python
import numpy as np
import hdbscan
from hdbscan.hdbscan_ import label, _tree_to_labels
from sklearn.metrics import adjusted_rand_score

X = np.random.rand(500, 2)

min_samples = 10
min_cluster_size = 5

base_clusterer = hdbscan.HDBSCAN(
    min_cluster_size=min_cluster_size,
    min_samples=min_samples,
    metric='euclidean',
    match_reference_implementation=True,
    approx_min_span_tree=False,
    gen_min_span_tree=True
).fit(X)

labels_base = base_clusterer.labels_

mst = base_clusterer.minimum_spanning_tree_.to_numpy()

mst_sorted = mst[np.argsort(mst[:, 2]), :]

single_linkage_tree = label(mst_sorted)

labels_reused, probs_reused, *_ = _tree_to_labels(
    X,
    single_linkage_tree,
    min_cluster_size=min_cluster_size,
    cluster_selection_method=base_clusterer.cluster_selection_method,
    allow_single_cluster=base_clusterer.allow_single_cluster,
    match_reference_implementation=base_clusterer.match_reference_implementation,
    cluster_selection_epsilon=base_clusterer.cluster_selection_epsilon
)

identical = np.array_equal(labels_base, labels_reused)
diff_count = np.sum(labels_base != labels_reused)
ari_score = adjusted_rand_score(labels_base, labels_reused)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to obtain consistent clustering results by reusing single linkage tree with multiple min_cluster_size values? #682

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to obtain consistent clustering results by reusing single linkage tree with multiple min_cluster_size values? #682

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions