Skip to content

How to obtain consistent clustering results by reusing single linkage tree with multiple min_cluster_size values? #682

@yun881201

Description

@yun881201

I'm currently using HDBSCAN for clustering with a fixed min_samples parameter, and I need to test multiple different values of min_cluster_size. To speed up the process, I want to compute the single linkage tree only once (since min_samples is fixed), and then reuse this tree to efficiently obtain clustering results for various min_cluster_size values.

However, when I reuse the single_linkage_tree_ (converted via .to_numpy()) with _tree_to_labels() functions, the resulting labels are not always exactly the same as the labels obtained directly from running HDBSCAN().fit() with the same parameters. It seems that minor numerical differences or differences in tie-breaking within the internal Cython implementation vs. the Python interface cause this discrepancy.

Is there an officially recommended method, or a precise way, to reuse the single linkage tree to achieve exactly the same clustering labels as the direct .fit() calls across multiple min_cluster_size parameters?

Thank you very much in advance!

My code:

import numpy as np
import hdbscan
from hdbscan.hdbscan_ import label, _tree_to_labels
from sklearn.metrics import adjusted_rand_score

X = np.random.rand(500, 2)

min_samples = 10
min_cluster_size = 5

base_clusterer = hdbscan.HDBSCAN(
    min_cluster_size=min_cluster_size,
    min_samples=min_samples,
    metric='euclidean',
    match_reference_implementation=True,
    approx_min_span_tree=False,
    gen_min_span_tree=True
).fit(X)

labels_base = base_clusterer.labels_

mst = base_clusterer.minimum_spanning_tree_.to_numpy()

mst_sorted = mst[np.argsort(mst[:, 2]), :]

single_linkage_tree = label(mst_sorted)

labels_reused, probs_reused, *_ = _tree_to_labels(
    X,
    single_linkage_tree,
    min_cluster_size=min_cluster_size,
    cluster_selection_method=base_clusterer.cluster_selection_method,
    allow_single_cluster=base_clusterer.allow_single_cluster,
    match_reference_implementation=base_clusterer.match_reference_implementation,
    cluster_selection_epsilon=base_clusterer.cluster_selection_epsilon
)

identical = np.array_equal(labels_base, labels_reused)
diff_count = np.sum(labels_base != labels_reused)
ari_score = adjusted_rand_score(labels_base, labels_reused)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions