Skip to content

berraM/adlm_onconetmm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

164 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OncoNetMM: Multi-Modal Deep Learning for Oncology Outcome Prediction

OncoNetMM is a multi-modal deep learning model designed to predict oncology outcomes by integrating imaging, genomics, and treatment data.


Project Structure

Dataset

We use the Duke-Breast-Cancer-MRI dataset [1], which includes MRI images and clinical variables for 922 subjects.
The dataset is located at:
/vol/miltank/projects/practical_sose25/adlm_oncology/outputs

/vol/miltank/projects/practical_sose25/adlm_oncology/outputs has the following subfolders:

  • Images

    • Raw and preprocessed images are stored in the images subfolder.
  • Clinical Data

    • The clinical_dataset subfolder contains the preprocessed clinical data used for the clinical baseline model.
    • This subfolder also includes saved trained models with TensorBoard outputs.
  • Multi-Modal Data

    • The multi_modal subdirectory contains preprocessed clinical data for the multi-modal model (clinical_preprocessing subfolder).
    • The actual preprocessed multi-modal dataset, ready for training, is stored in the multi_modal_preprocessing subfolder, along with the saved trained models and TensorBoard outputs of all multi-modal models.

Src

The src folder includes:

  1. clinical/

    • Data preprocessing scripts for clinical data.
    • The clinical MLP branch of the multi-modal model.
    • run_datapreprocessing_clinical.sbatch for submitting clinical preprocessing as a Slurm job.
    • Slurm logs are saved in the slurm_output directory within this subfolder.
  2. clinical_feature_model/

    • Contains the clinical baseline model and its data preprocessing scripts.
    • This model is purely trained on the clinical data and does not use any imaging data.
    • This folder contains several .sbatch files for scheduling Slurm jobs:
      • run_datapreprocessing_baseline.sbatch for running the datapreprocessing for the clinical baseline model
      • run_OncoNetMM_clinical.sbatch for running the training of the clinical baseline model
      • run_wandb_sweep_OncoNetMM.sbatch for running a hyperparameter search using the "Weights and Biases (wandb)" framework. This search can be configured by adapting the sweep_config.yaml and following the official documentation provided by wandb (https://docs.wandb.ai/).
  3. imaging/

    • Contains the imaging model and preprocessing scripts.
    • Includes an sbatch file for submitting image preprocessing as a Slurm job.
  4. treatment_recommendation_and_evaluation/

    • Contains the imaging model and preprocessing scripts.
    • Contains two different .sbatch files for scheduling Slurm jobs:
      • run_recommend_treatment.sbatch for performing a treatment recommendation for a single patient.
      • run_rec_tre_batch_and_eval.sbatch for performing batch recommendation and evaluation of those recommendations using Kaplan Meier curves and log-rank p-value for all patients from one or multiple dataset splits.

Other files in src:

  • config.py – Configuration for training and model architecture.
  • data_preprocessing_multi_modal.py – Merges preprocessed clinical and imaging data into a single training dataset.
  • inference_timer.py – Measures the inference time of a single sample for multi-modal models.
  • model.py – Includes both the baseline multi-modal model and the attention-based multi-modal model.
  • data.py – Includes the combined dataset class for the processed images and clinical data.
  • training.py – Implements the training pipeline for multi-modal models.
  • environment_adlm-oncology_forLinux.yml – YAML file that was created from the conda environment adlm-oncology. This environment was used for the execution of all the code within this repository.

Other files in src/clinical_feature_model:

  • config.py – Configuration for training, model architecture, hyperparameters and other details.
  • data_preprocessing_baseline.py – Executes a slightly adapted version of data_preprocessing_clinical.py in src/clinical specifically for the clinical baseline model.
  • model.py – Defines the clinical baseline model.
  • training.py – Implements the training pipeline for the clinical baseline model.

Other files in src/treatment_recommendation_and_evaluation:

  • config_treat_Rec_and_Eval.py – Configuration file for both types of treatment recommendation (single vs. batch) and the evaluation. Allows configuration of input-filepaths and details of the evaluation.
  • patient_data_recommendation_input.py - Contains fixed inputs like default values and a list of clinical columns. The file is required for the recommendations but - other than the configuration file - is not meant to be modified.
  • rec_tre_batch_and_eval.py - Performs batch recommendation and evaluation of those recommendations using Kaplan Meier curves and log-rank p-value for all patients from one or multiple dataset splits.
  • recommend_treatment.py - Performs a treatment recommendation for a single patient.

Trained Models and clinical input for recommender

The folder trained_models_and_clinical_input_for_recommender contains:

  1. The weights and biases of three trained models and their respective config files during training, one model for each of the three architectures: clinical baseline, multi modal baseline, multi modal with attention. Together with the architectural information within the config files, these models can be used for treatment recommendation and evaluation of the models. For details and instructions on how to do that, please see point "4. treatment_recommendation_and_evaluation" in the src section above and the "How to Run" section below.
  2. Processed clinical data in .csv format. This file is formatted in a way, so the clinical data for a single patient of choice can easily be copied and used for treatment recommendation. More about this can be read under Settings for single patient recommendations in the section How to Run - Part 3.

How to Run

How to Run - Part 1: Running Code related to the Multi Modal Models

The data preprocessing and training pipeline for OncoNetMM is structured as follows:

1. Preprocess Clinical Data

Run:

sbatch run_datapreprocessing_clinical.sbatch
  • This executes /src/clinical/data_preprocessing_clinical.py.
  • Flags:
    • -p – Specify the dataset to preprocess.
    • -d – Specify the directory where the preprocessed file will be saved.

2. Preprocess Imaging Data

Run:

sbatch run_image_preprocessing.sbatch
  • This executes /src/imaging/image_preprocessing.py.
  • Important Flags:
    • -d – Specify the dataset to preprocess.
    • -o – Specify the directory where the preprocessed images will be saved.
    • -s – Specify the output shape of the preprocessed images.

3. Merge Clinical and Imaging Data

Run:

sbatch run_datapreprocessing_multi_modal.sbatch
  • Flags:
    • -d – Path to preprocessed clinical data input file.
    • -i – Path to preprocessed imaging data input file.
    • -o – Directory where the merged dataset will be saved.

4. Train Models

Baseline Multi-Modal Model:

sbatch run_OncoNetMM_baseline.sbatch
  • Slurm output: adlm_oncology_baseline-%A.out
  • Saved model: multi_modal_model_baseline

Attention-Based Multi-Modal Model:

sbatch run_OncoNetMM_attention.sbatch
  • Slurm output: adlm_oncology_attention-%A.out
  • Saved model: multi_modal_model_attention

Both commands run training.py, with the -m flag switching between the attention-based and baseline models.
Saved models and TensorBoard outputs are stored in a subdirectory within the input directory with the name you can specify using the -o flag.

5. Evaluate Inference Time

Attention Model:

run_OncoNetMM_attention_inference_time_clinical_resources.sbatch
run_OncoNetMM_attention_inference_time_gpu.sbatch
  • The first runs with Clinical PC resources (4 CPU cores, 16GB RAM).
  • The second uses the same GPU resources as the training process.

Baseline Model:

run_OncoNetMM_baseline_inference_time_clinical_resources.sbatch
run_OncoNetMM_baseline_inference_time_gpu.sbatch
  • The first runs with Clinical PC resources (4 CPU cores, 16GB RAM).
  • The second uses the same GPU resources as the training process.

How to Run - Part 2: Running Code related to the Clinical Baseline Model

The data preprocessing and training pipeline for the Clinical Baseline Model is structured as follows:

1. Preprocess Clinical Data

Run:

sbatch run_datapreprocessing_baseline.sbatch 	
  • This executes /src/clinical_feature_model/data_preprocessing_baseline.py.
  • Flags:
    • -p – Specify the path to the dataset to preprocess (In this case, the Clinical_and_Other_Features.xlsx file from our breast cancer dataset [1]).
    • -d – Specify the directory where the preprocessed file will be saved.

2. Train Clinical Baseline Model

Run:

sbatch run_OncoNetMM_clinical.sbatch 	
  • This executes /src/clinical_feature_model/training.py.
  • Important Flags:
    • --no_wandb - This flag deactivates logging with the Weights and Biases package. Remove this Flag, if you set up logging with wandb according to the official documentation (https://docs.wandb.ai/).
    • --save - Keep this flag to save training outputs like the model weights (.pt), the config (.txt) and the training set (.parquet).
    • Many other configurations can be made within the config file: /src/clinical_feature_model/config.py. These settings include the input_folder, which also serves as output folder, and multiple hyperparameters. The input_folder must contain the output files from the baseline data preprocessing.

3. Optional: Run Hyperparameter Search

Run:

sbatch run_wandb_sweep_OncoNetMM.sbatch 	
  • This starts a hyperparameter search using the "Weights and Biases (wandb)" framework.
  • This search can be configured before starting it, by adapting the sweep_config.yaml and following the official documentation provided by wandb (https://docs.wandb.ai/).
  • The contents of run_wandb_sweep_OncoNetMM.sbatch must be adapted accordingly, by replacing the existing lines below the comments # 1) and # 2) within the file.

How to Run - Part 3: Running Code for the treatment recommendation and evaluation

1. Adapt Configuration File as necessary

Ensure that all configurations in config_treat_Rec_and_Eval.py are correctly set.

Config Section 1:

  • Setting the value of rec_and_eval_mode determines what type of model is being evaluated

Config Section 2:

  • General Settings for Recommendation and Evaluation: The path to the weights of the model (.pt file), which is supposed to be used for the treatment recommendation (and possibly evaluation), must be set here.
    • For recommendations with the clinical baseline model: The path to the unprocessed Clinical_and_Other_Features.xlsx file from the breast cancer dataset [1] must be specified as the XLSX_PATH. The required pre-processing is performed automatically.
    • For recommendations with a multi modal model: The path to the processed clinical data (filename usually ends in _xc.csv) must also be provided here, since it contains the filepaths to the MRI images for each patient. The MRI images are required as inputs for the model inference.
  • Evaluation-specific Settings: You can choose, which dataset splits (train/val/test) should be included in the evaluation. To stick to best practices, it is recommended to only use patients from the test set. However, including more patients leads to smaller confidence intervals and smoother curves, due to higher event counts.

Config Section 3 (advanced):

  • Settings for single patient recommendations (advanced): Specify the required clinical values for a single patient of choice from the specifically formatted file clinical_xc_single_recommendation_inputs.csv in the folder trained_models_and_clinical_input_for_recommender. The data is already correctly formatted. Copy the values from the first 76 named columns - excluding the unnamed leading patient ID column (integer format) - and paste them into config_treat_Rec_and_Eval.py as the value of CSV_CLINICAL_VALUES. If the single recommendation is for a multi modal model, the IMAGE_PATIENT_ID must also be specified in config_treat_Rec_and_Eval.py. The IMAGE_PATIENT_ID is the value of the Second to last column Patient Information - Patient ID in the same row.

2. Run Batch Recommendation and Evaluation or Single Patient Recommendation

To perform a batch recommendation and evaluation, run:

sbatch run_rec_tre_batch_and_eval.sbatch 	

To perform a treatment regimen recommendation for a single patient, run:

sbatch run_recommend_treatment.sbatch 	

The model recommendation ranks 10 different treatment regimens based on their ascending risk score. Those 10 regimens are the 10 most prescribed treatment regimens within the Duke-Breast-Cancer-MRI dataset [1]. The treatment regimens with the lowest risk score is the most recommendable, according to the model.

In the text output of the code (printed to slurm.out or to the terminal), the regimens are referred to as "Regimen 1" through "Regimen 10". This is what those regimens refer to in a clinical context:

Surgery Surgery Type Neo-RT Adj-RT Neo-CT Adj-CT Neo-ET Adj-ET Ooph. Neo-Her2 Adj-Her2
Regimen 1 1 0 0 1 0 0 0 1 0 0 0
Regimen 2 1 0 0 1 0 1 0 1 0 0 0
Regimen 3 1 1 0 0 0 0 0 1 0 0 0
Regimen 4 1 1 0 1 0 1 0 1 0 0 0
Regimen 5 1 0 0 1 1 0 0 1 0 0 0
Regimen 6 1 0 0 1 1 0 0 0 0 0 0
Regimen 7 1 0 0 1 0 1 0 0 0 0 0
Regimen 8 1 1 0 1 1 0 0 1 0 0 0
Regimen 9 1 1 0 0 0 1 0 1 0 0 0
Regimen 10 1 1 0 0 0 1 0 0 0 0 0

Meaning of numbers:

  • If not specified otherwise: No (0) vs. Yes (1)
  • Surgery Type: Breast Conservation Therapy (0) vs. Mastectomy (1)

Meaning of abbreviations:

  • Ooph. = Oophorectomy
  • Neo = Neoadjuvant (before surgery)
  • Adj = Adjuvant (after surgery)
  • RT = Radiation Therapy
  • CT = Chemotherapy
  • ET = Endocrine Therapy
  • Her2 = Her-2 Therapy (Antibodies)

References

[1] Duke-Breast-Cancer-MRI Saha, A., Harowicz, M. R., Grimm, L. J., Weng, J., Cain, E. H., Kim, C. E., Ghate, S. V., Walsh, R., & Mazurowski, M. A. (2021). Dynamic contrast-enhanced magnetic resonance images of breast cancer patients with tumor locations [Data set]. The Cancer Imaging Archive.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors