Skip to content

Loading data into MetricLearningDataModule #31

@os10-wsi

Description

@os10-wsi

Hello,

I am trying to train scimilarity on my own data and have run into an issue loading the data into MetricLearningDataModule. I have set up the paths to my data and gene_file like this:

train_path="/os10/epithelial_reference_atlas_objects/epi_adata_train.zarr"
val_path ="/os10/epithelial_reference_atlas_objects/epi_adata_val.zarr"
gene_order="/os10/epithelial_reference_atlas_objects/gene_order.tsv"

When I run this in a Jupyter notebook, it returns the two lines at the bottom:

datamodule = MetricLearningDataModule(
    train_path=train_path,
    val_path =val_path,
    gene_order=gene_order,
    obs_field = "level_3_annot", 
    batch_size=1000,
    num_workers=4,
)

0it [00:00, ?it/s]
0it [00:00, ?it/s]

datamodule.int2label returns an empty dictionary,
and datamodule.n_genes returns 36390, which is the number of genes in the dataset I am training on (from gene_order).

I have confirmed that the .zarr folders are intact and retain the proper zarr file structure.

training the model returns:

model = MetricLearning(n_genes=datamodule.n_genes)

trainer.fit(model, datamodule)

You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name            | Type        | Params | Mode 
--------------------------------------------------------
0 | encoder         | Encoder     | 38.4 M | train
1 | decoder         | Decoder     | 38.5 M | train
2 | triplet_loss_fn | TripletLoss | 0      | train
3 | mse_loss_fn     | MSELoss     | 0      | train
--------------------------------------------------------
76.9 M    Trainable params
0         Non-trainable params
76.9 M    Total params
307.739   Total estimated model params size (MB)
27        Modules in train mode
0         Modules in eval mode

ValueError: num_samples should be a positive integer value, but got num_samples=0

Meaning the model training is initiated but there is no data to train on.

Can you please provide some advice on how to import .zarr files so that MetricLearningDataModule does not import an empty dictionary?

Thank you so much for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions