Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles
We introduce the first multimodal pretraining strategy tailored for cancer biology, using a transformer with a masked modeling objective to reconstruct masked omics modalities (RNA, DNAm, CNV), supported by histopathology.
Supporting code for the corresponding paper
MORPHEUS is a multimodal pre-training method based on masked omics modeling. It uses the code from masked modeling in NLP or computer vision, but on omics portions. In concrete terms, once pre-trained, the encoder can be used on different tasks and with different combinations of input modalities. It can also reconstruct any omic profile from histopathology. Unfortunately, due to the UNIv2 licensing restrictions, we are unable to provide pre-trained weights for MORPHEUS. However, we provide step-by-step instructions to reproduce the results exactly.
For data and preprocessing, you will need to follow the instructions given in the notebook data/preprocessing.ipynb
.
More specifically, for WSI, you'll need to download the embeddings of the patches already extracted from UNIv2 from the official HuggingFace Dataset.
For omics data:
Gene expression (RNA): Downloaded from the USCS Xena Database (explained in data/preprocessing.ipynb
)
DNA Methylation (DNAm) and Copy Number Variations (CNV): You will need to download data from GDC Data Portal (you can follow the instructions of MultiSurv or DRIM)
MORPHEUS allows simple reconstruction of any combination of omics modalities.
Concrete examples are given in the file: reconstructions.ipynb
.
Once all the data is properly prepared following the data
instructions.
Run the following
$ python pretrain_bmmae.py --project_id pancan --data_dir /PATH/TO/WSI
Fine-tuning MORPHEUS for few-shot subtyping or survival analysis is straightforward. For instance on WSI only
$ python wsi_surv.py --project_id gbmlgg --data_dir /PATH/TO/WSI
or combined with RNA
$ python wsi_rna_surv.py --project_id gbmlgg --data_dir /PATH/TO/WSI
For 5-shot subtyping, use the following CLI:
$ python wsi_few_shot.py --task breast --k 5 --data_dir /PATH/TO/WSI
💡 We are aware that biological data can be challenging to download, preprocess, and interpret. We have tried to make the process as straightforward as possible by following clear instructions. However, if any part is unclear or lacks sufficient detail, feel free to open an issue or submit a pull request.
@misc{robinet2025maskedomicsmodelingmultimodal,
title={Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles},
author={Lucas Robinet and Ahmad Berjaoui and Elizabeth Cohen-Jonathan Moyal},
year={2025},
eprint={2508.00969},
archivePrefix={arXiv},
}