-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
Hi, I have a large dataset (>100k samples) that contains a lot of duplicates.
MSPHATE does not converge during the Calculating partitions... step.
I can't share the dataset in question, but I think I replicated the effect with some randomly generated data. See the following code and output:
import numpy as np
from multiscale_phate import compress, diffuse, condense
np.random.seed(42)
# spoof data
data = np.random.uniform(size=(10001, 200))
data = np.vstack([data, data, data, data, data, data, data, data, data, data]) # highly redundant
# spoof MSPHATE compress step
N, features = data.shape
n_pca = 200
partitions = None
# Computing compression features
n_pca, partitions = compress.get_compression_features(
N, features, n_pca, partitions, landmarks=2000
)
# modified to display np.max(cluster_counts) and np.ceil(N / desired_num_clusters)
_ = compress.subset_data(data, desired_num_clusters=partitions, n_jobs=8, num_cluster=100, random_state=None)
output:
Calculating partitions...
np.max(cluster_counts): 3930
np.ceil(N / desired_num_clusters): 6.0
np.max(cluster_counts): 1120
np.ceil(N / desired_num_clusters): 6.0
np.max(cluster_counts): 70
np.ceil(N / desired_num_clusters): 6.0
np.max(cluster_counts): 10
np.ceil(N / desired_num_clusters): 6.0
np.max(cluster_counts): 10
np.ceil(N / desired_num_clusters): 6.0
np.max(cluster_counts): 10
np.ceil(N / desired_num_clusters): 6.0
np.max(cluster_counts): 10
The output is the same after many iterations.
Note: I am using python 3.8 and installed using pip install git+https://github.com/KrishnaswamyLab/Multiscale_PHATE
Metadata
Metadata
Assignees
Labels
No labels