Skip to content

Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles

License

Notifications You must be signed in to change notification settings

Lucas-rbnt/MORPHEUS

Repository files navigation

Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles

We introduce the first multimodal pretraining strategy tailored for cancer biology, using a transformer with a masked modeling objective to reconstruct masked omics modalities (RNA, DNAm, CNV), supported by histopathology.

Supporting code for the corresponding paper

🌰 In a nutshell

MORPHEUS is a multimodal pre-training method based on masked omics modeling. It uses the code from masked modeling in NLP or computer vision, but on omics portions. In concrete terms, once pre-trained, the encoder can be used on different tasks and with different combinations of input modalities. It can also reconstruct any omic profile from histopathology. Unfortunately, due to the UNIv2 licensing restrictions, we are unable to provide pre-trained weights for MORPHEUS. However, we provide step-by-step instructions to reproduce the results exactly.

description of image

📂 The Data

For data and preprocessing, you will need to follow the instructions given in the notebook data/preprocessing.ipynb. More specifically, for WSI, you'll need to download the embeddings of the patches already extracted from UNIv2 from the official HuggingFace Dataset. For omics data:
Gene expression (RNA): Downloaded from the USCS Xena Database (explained in data/preprocessing.ipynb) DNA Methylation (DNAm) and Copy Number Variations (CNV): You will need to download data from GDC Data Portal (you can follow the instructions of MultiSurv or DRIM)

🧠 Reconstructing omics modalities

MORPHEUS allows simple reconstruction of any combination of omics modalities. Concrete examples are given in the file: reconstructions.ipynb.

🔄 Pretraining MORPHEUS

Once all the data is properly prepared following the data instructions. Run the following

$ python pretrain_bmmae.py --project_id pancan --data_dir /PATH/TO/WSI

🎯 Fine-tuning

Fine-tuning MORPHEUS for few-shot subtyping or survival analysis is straightforward. For instance on WSI only

$ python wsi_surv.py --project_id gbmlgg --data_dir /PATH/TO/WSI

or combined with RNA

$ python wsi_rna_surv.py --project_id gbmlgg --data_dir /PATH/TO/WSI

For 5-shot subtyping, use the following CLI:

$ python wsi_few_shot.py --task breast --k 5 --data_dir /PATH/TO/WSI

⭐ Contribute & Support

💡 We are aware that biological data can be challenging to download, preprocess, and interpret. We have tried to make the process as straightforward as possible by following clear instructions. However, if any part is unclear or lacks sufficient detail, feel free to open an issue or submit a pull request.

Cite

@misc{robinet2025maskedomicsmodelingmultimodal,
      title={Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles}, 
      author={Lucas Robinet and Ahmad Berjaoui and Elizabeth Cohen-Jonathan Moyal},
      year={2025},
      eprint={2508.00969},
      archivePrefix={arXiv},
}

About

Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published