Skip to content

subset_data does not converge #8

@MattScicluna

Description

@MattScicluna

Hi, I have a large dataset (>100k samples) that contains a lot of duplicates.
MSPHATE does not converge during the Calculating partitions... step.

I can't share the dataset in question, but I think I replicated the effect with some randomly generated data. See the following code and output:

import numpy as np
from multiscale_phate import compress, diffuse, condense

np.random.seed(42)

# spoof data
data = np.random.uniform(size=(10001, 200))
data = np.vstack([data, data, data, data, data, data, data, data, data, data])  # highly redundant

# spoof MSPHATE compress step
N, features = data.shape
n_pca = 200
partitions = None

# Computing compression features
n_pca, partitions = compress.get_compression_features(
    N, features, n_pca, partitions, landmarks=2000
)

# modified to display np.max(cluster_counts) and np.ceil(N / desired_num_clusters)
_ = compress.subset_data(data, desired_num_clusters=partitions, n_jobs=8, num_cluster=100, random_state=None)

output:

Calculating partitions...
np.max(cluster_counts):  3930
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  1120
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  70
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10

The output is the same after many iterations.

Note: I am using python 3.8 and installed using pip install git+https://github.com/KrishnaswamyLab/Multiscale_PHATE

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions