-
Notifications
You must be signed in to change notification settings - Fork 518
Description
I'm currently using HDBSCAN for clustering with a fixed min_samples parameter, and I need to test multiple different values of min_cluster_size. To speed up the process, I want to compute the single linkage tree only once (since min_samples is fixed), and then reuse this tree to efficiently obtain clustering results for various min_cluster_size values.
However, when I reuse the single_linkage_tree_ (converted via .to_numpy()) with _tree_to_labels() functions, the resulting labels are not always exactly the same as the labels obtained directly from running HDBSCAN().fit() with the same parameters. It seems that minor numerical differences or differences in tie-breaking within the internal Cython implementation vs. the Python interface cause this discrepancy.
Is there an officially recommended method, or a precise way, to reuse the single linkage tree to achieve exactly the same clustering labels as the direct .fit() calls across multiple min_cluster_size parameters?
Thank you very much in advance!
My code:
import numpy as np
import hdbscan
from hdbscan.hdbscan_ import label, _tree_to_labels
from sklearn.metrics import adjusted_rand_score
X = np.random.rand(500, 2)
min_samples = 10
min_cluster_size = 5
base_clusterer = hdbscan.HDBSCAN(
min_cluster_size=min_cluster_size,
min_samples=min_samples,
metric='euclidean',
match_reference_implementation=True,
approx_min_span_tree=False,
gen_min_span_tree=True
).fit(X)
labels_base = base_clusterer.labels_
mst = base_clusterer.minimum_spanning_tree_.to_numpy()
mst_sorted = mst[np.argsort(mst[:, 2]), :]
single_linkage_tree = label(mst_sorted)
labels_reused, probs_reused, *_ = _tree_to_labels(
X,
single_linkage_tree,
min_cluster_size=min_cluster_size,
cluster_selection_method=base_clusterer.cluster_selection_method,
allow_single_cluster=base_clusterer.allow_single_cluster,
match_reference_implementation=base_clusterer.match_reference_implementation,
cluster_selection_epsilon=base_clusterer.cluster_selection_epsilon
)
identical = np.array_equal(labels_base, labels_reused)
diff_count = np.sum(labels_base != labels_reused)
ari_score = adjusted_rand_score(labels_base, labels_reused)