OncoNetMM is a multi-modal deep learning model designed to predict oncology outcomes by integrating imaging, genomics, and treatment data.
We use the Duke-Breast-Cancer-MRI dataset [1], which includes MRI images and clinical variables for 922 subjects.
The dataset is located at:
/vol/miltank/projects/practical_sose25/adlm_oncology/outputs
/vol/miltank/projects/practical_sose25/adlm_oncology/outputs has the following subfolders:
-
Images
- Raw and preprocessed images are stored in the
imagessubfolder.
- Raw and preprocessed images are stored in the
-
Clinical Data
- The
clinical_datasetsubfolder contains the preprocessed clinical data used for the clinical baseline model. - This subfolder also includes saved trained models with TensorBoard outputs.
- The
-
Multi-Modal Data
- The
multi_modalsubdirectory contains preprocessed clinical data for the multi-modal model (clinical_preprocessingsubfolder). - The actual preprocessed multi-modal dataset, ready for training, is stored in the
multi_modal_preprocessingsubfolder, along with the saved trained models and TensorBoard outputs of all multi-modal models.
- The
The src folder includes:
-
clinical/
- Data preprocessing scripts for clinical data.
- The clinical MLP branch of the multi-modal model.
run_datapreprocessing_clinical.sbatchfor submitting clinical preprocessing as a Slurm job.- Slurm logs are saved in the
slurm_outputdirectory within this subfolder.
-
clinical_feature_model/
- Contains the clinical baseline model and its data preprocessing scripts.
- This model is purely trained on the clinical data and does not use any imaging data.
- This folder contains several .sbatch files for scheduling Slurm jobs:
run_datapreprocessing_baseline.sbatchfor running the datapreprocessing for the clinical baseline modelrun_OncoNetMM_clinical.sbatchfor running the training of the clinical baseline modelrun_wandb_sweep_OncoNetMM.sbatchfor running a hyperparameter search using the "Weights and Biases (wandb)" framework. This search can be configured by adapting thesweep_config.yamland following the official documentation provided by wandb (https://docs.wandb.ai/).
-
imaging/
- Contains the imaging model and preprocessing scripts.
- Includes an sbatch file for submitting image preprocessing as a Slurm job.
-
treatment_recommendation_and_evaluation/
- Contains the imaging model and preprocessing scripts.
- Contains two different .sbatch files for scheduling Slurm jobs:
run_recommend_treatment.sbatchfor performing a treatment recommendation for a single patient.run_rec_tre_batch_and_eval.sbatchfor performing batch recommendation and evaluation of those recommendations using Kaplan Meier curves and log-rank p-value for all patients from one or multiple dataset splits.
Other files in src:
- config.py – Configuration for training and model architecture.
- data_preprocessing_multi_modal.py – Merges preprocessed clinical and imaging data into a single training dataset.
- inference_timer.py – Measures the inference time of a single sample for multi-modal models.
- model.py – Includes both the baseline multi-modal model and the attention-based multi-modal model.
- data.py – Includes the combined dataset class for the processed images and clinical data.
- training.py – Implements the training pipeline for multi-modal models.
- environment_adlm-oncology_forLinux.yml – YAML file that was created from the conda environment adlm-oncology. This environment was used for the execution of all the code within this repository.
Other files in src/clinical_feature_model:
- config.py – Configuration for training, model architecture, hyperparameters and other details.
- data_preprocessing_baseline.py – Executes a slightly adapted version of
data_preprocessing_clinical.pyinsrc/clinicalspecifically for the clinical baseline model. - model.py – Defines the clinical baseline model.
- training.py – Implements the training pipeline for the clinical baseline model.
Other files in src/treatment_recommendation_and_evaluation:
- config_treat_Rec_and_Eval.py – Configuration file for both types of treatment recommendation (single vs. batch) and the evaluation. Allows configuration of input-filepaths and details of the evaluation.
- patient_data_recommendation_input.py - Contains fixed inputs like default values and a list of clinical columns. The file is required for the recommendations but - other than the configuration file - is not meant to be modified.
- rec_tre_batch_and_eval.py - Performs batch recommendation and evaluation of those recommendations using Kaplan Meier curves and log-rank p-value for all patients from one or multiple dataset splits.
- recommend_treatment.py - Performs a treatment recommendation for a single patient.
The folder trained_models_and_clinical_input_for_recommender contains:
- The weights and biases of three trained models and their respective config files during training, one model for each of the three architectures: clinical baseline, multi modal baseline, multi modal with attention.
Together with the architectural information within the config files, these models can be used for treatment recommendation and evaluation of the models. For details and instructions on how to do that, please see point "4. treatment_recommendation_and_evaluation" in the
srcsection above and the "How to Run" section below. - Processed clinical data in .csv format. This file is formatted in a way, so the clinical data for a single patient of choice can easily be copied and used for treatment recommendation. More about this can be read under Settings for single patient recommendations in the section How to Run - Part 3.
The data preprocessing and training pipeline for OncoNetMM is structured as follows:
Run:
sbatch run_datapreprocessing_clinical.sbatch- This executes
/src/clinical/data_preprocessing_clinical.py. - Flags:
-p– Specify the dataset to preprocess.-d– Specify the directory where the preprocessed file will be saved.
Run:
sbatch run_image_preprocessing.sbatch- This executes
/src/imaging/image_preprocessing.py. - Important Flags:
-d– Specify the dataset to preprocess.-o– Specify the directory where the preprocessed images will be saved.-s– Specify the output shape of the preprocessed images.
Run:
sbatch run_datapreprocessing_multi_modal.sbatch- Flags:
-d– Path to preprocessed clinical data input file.-i– Path to preprocessed imaging data input file.-o– Directory where the merged dataset will be saved.
Baseline Multi-Modal Model:
sbatch run_OncoNetMM_baseline.sbatch- Slurm output:
adlm_oncology_baseline-%A.out - Saved model:
multi_modal_model_baseline
Attention-Based Multi-Modal Model:
sbatch run_OncoNetMM_attention.sbatch- Slurm output:
adlm_oncology_attention-%A.out - Saved model:
multi_modal_model_attention
Both commands run training.py, with the -m flag switching between the attention-based and baseline models.
Saved models and TensorBoard outputs are stored in a subdirectory within the input directory with the name you can specify using the -o flag.
Attention Model:
run_OncoNetMM_attention_inference_time_clinical_resources.sbatch
run_OncoNetMM_attention_inference_time_gpu.sbatch- The first runs with Clinical PC resources (4 CPU cores, 16GB RAM).
- The second uses the same GPU resources as the training process.
Baseline Model:
run_OncoNetMM_baseline_inference_time_clinical_resources.sbatch
run_OncoNetMM_baseline_inference_time_gpu.sbatch- The first runs with Clinical PC resources (4 CPU cores, 16GB RAM).
- The second uses the same GPU resources as the training process.
The data preprocessing and training pipeline for the Clinical Baseline Model is structured as follows:
Run:
sbatch run_datapreprocessing_baseline.sbatch - This executes
/src/clinical_feature_model/data_preprocessing_baseline.py. - Flags:
-p– Specify the path to the dataset to preprocess (In this case, theClinical_and_Other_Features.xlsxfile from our breast cancer dataset [1]).-d– Specify the directory where the preprocessed file will be saved.
Run:
sbatch run_OncoNetMM_clinical.sbatch - This executes
/src/clinical_feature_model/training.py. - Important Flags:
--no_wandb- This flag deactivates logging with the Weights and Biases package. Remove this Flag, if you set up logging withwandbaccording to the official documentation (https://docs.wandb.ai/).--save- Keep this flag to save training outputs like the model weights (.pt), the config (.txt) and the training set (.parquet).- Many other configurations can be made within the config file:
/src/clinical_feature_model/config.py. These settings include theinput_folder, which also serves as output folder, and multiple hyperparameters. Theinput_foldermust contain the output files from the baseline data preprocessing.
Run:
sbatch run_wandb_sweep_OncoNetMM.sbatch - This starts a hyperparameter search using the "Weights and Biases (wandb)" framework.
- This search can be configured before starting it, by adapting the
sweep_config.yamland following the official documentation provided by wandb (https://docs.wandb.ai/). - The contents of
run_wandb_sweep_OncoNetMM.sbatchmust be adapted accordingly, by replacing the existing lines below the comments # 1) and # 2) within the file.
Ensure that all configurations in config_treat_Rec_and_Eval.py are correctly set.
Config Section 1:
- Setting the value of
rec_and_eval_modedetermines what type of model is being evaluated
Config Section 2:
- General Settings for Recommendation and Evaluation: The path to the weights of the model (.pt file), which is supposed to be used for the treatment recommendation (and possibly evaluation), must be set here.
- For recommendations with the clinical baseline model: The path to the unprocessed
Clinical_and_Other_Features.xlsxfile from the breast cancer dataset [1] must be specified as theXLSX_PATH. The required pre-processing is performed automatically. - For recommendations with a multi modal model: The path to the processed clinical data (filename usually ends in _xc.csv) must also be provided here, since it contains the filepaths to the MRI images for each patient. The MRI images are required as inputs for the model inference.
- For recommendations with the clinical baseline model: The path to the unprocessed
- Evaluation-specific Settings: You can choose, which dataset splits (train/val/test) should be included in the evaluation. To stick to best practices, it is recommended to only use patients from the test set. However, including more patients leads to smaller confidence intervals and smoother curves, due to higher event counts.
Config Section 3 (advanced):
- Settings for single patient recommendations (advanced): Specify the required clinical values for a single patient of choice from the specifically formatted file
clinical_xc_single_recommendation_inputs.csvin the foldertrained_models_and_clinical_input_for_recommender. The data is already correctly formatted. Copy the values from the first 76 named columns - excluding the unnamed leading patient ID column (integer format) - and paste them intoconfig_treat_Rec_and_Eval.pyas the value ofCSV_CLINICAL_VALUES. If the single recommendation is for a multi modal model, theIMAGE_PATIENT_IDmust also be specified inconfig_treat_Rec_and_Eval.py. TheIMAGE_PATIENT_IDis the value of the Second to last columnPatient Information - Patient IDin the same row.
To perform a batch recommendation and evaluation, run:
sbatch run_rec_tre_batch_and_eval.sbatch To perform a treatment regimen recommendation for a single patient, run:
sbatch run_recommend_treatment.sbatch The model recommendation ranks 10 different treatment regimens based on their ascending risk score. Those 10 regimens are the 10 most prescribed treatment regimens within the Duke-Breast-Cancer-MRI dataset [1]. The treatment regimens with the lowest risk score is the most recommendable, according to the model.
In the text output of the code (printed to slurm.out or to the terminal), the regimens are referred to as "Regimen 1" through "Regimen 10". This is what those regimens refer to in a clinical context:
| Surgery | Surgery Type | Neo-RT | Adj-RT | Neo-CT | Adj-CT | Neo-ET | Adj-ET | Ooph. | Neo-Her2 | Adj-Her2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Regimen 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Regimen 2 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| Regimen 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Regimen 4 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| Regimen 5 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| Regimen 6 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Regimen 7 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Regimen 8 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| Regimen 9 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| Regimen 10 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Meaning of numbers:
- If not specified otherwise: No (0) vs. Yes (1)
- Surgery Type: Breast Conservation Therapy (0) vs. Mastectomy (1)
Meaning of abbreviations:
- Ooph. = Oophorectomy
- Neo = Neoadjuvant (before surgery)
- Adj = Adjuvant (after surgery)
- RT = Radiation Therapy
- CT = Chemotherapy
- ET = Endocrine Therapy
- Her2 = Her-2 Therapy (Antibodies)
[1] Duke-Breast-Cancer-MRI Saha, A., Harowicz, M. R., Grimm, L. J., Weng, J., Cain, E. H., Kim, C. E., Ghate, S. V., Walsh, R., & Mazurowski, M. A. (2021). Dynamic contrast-enhanced magnetic resonance images of breast cancer patients with tumor locations [Data set]. The Cancer Imaging Archive.