Skip to content

Machine learning toolkit for predicting disease associated genetic variants

Li Chen edited this page Mar 26, 2020 · 10 revisions

Background

Due to the important role of genetic variants in disease onset, it is important to identify these disease-associated genetic variants accurately, which is crucial in precision medicine. However, there is no such R package available. In this GSOC project, we will invite students to tackle this problem by developing a machine learning toolkit consisting of several advanced machine learning algorithms to predict disease-associated genetic variants under my guidance.

Details of your coding project

The purpose of this work is to provide R users with a comprehensive machine learning toolkit for predicting disease-associated genetic variants. It mainly consists of two core novel machine learning algorithms:

A weighted ensemble learning framework for predicting disease-associated genetic variants: The model will ensemble multiple score systems for predicting disease-associated genetic variants in a unified framework by developing a constrained penalized optimization algorithm. A transfer learning framework based on convolutional neural network (CNN): The CNN is powerful when the sample size for a specific disease is small. In this case, CNN will be trained on an experiment-validated large-scale genetic variants from mixed diseases and fine-tuned using disease-specific genetic variants.

Details: A weighted ensemble learning framework: Build the probability density functions using precomputed scores from multiple scoring systems via kernel density estimation. Implement constrained penalized optimization algorithm. Design the simulation studies to test the model. Test the model on some real datasets[1,2,3]. A transfer learning framework based on convolutional neural network (CNN): Build a backbone convolutional neural network, which contains an embedding layer with different embedding sizes, 1D / 2D or dilated convolutional layers with different windows sizes and strides, max-pooling layers and fully-connected layers by using TensorFlow for R API. Different optimization methods such as Adam, Rmsprop, and SGD with momentum will be applied and trained separately, and the results will be compared. The backbone CNN will be trained on an experiment-validated large-scale dataset [1] and will be fine-tuned on the different disease-specific datasets [2, 3]. Students will also be designing different strategies to fine-tune the CNN (eg. freeze some layers; re-train some layers) Some popular methods such as Dropout (spatial), Batch normalization and regularization are also introduced to backbone DNN to avoid some common problems in neural networks such as overfitting and non-convergence.

Mentors

Li Chen [email protected] is a tenure-track Assistant Professor of Medicine and a member in the Center for Computational Biology and Bioinformatics at Indiana University School of Medicine (IUSM). He was a previous mentor of GSOC.

Tests for potential students

Easy: Can you explain what is the convolutional neural network? Can you explain what is the transfer learning? Can you install TensorFlow in your machine and implement a simple CNN on MNIST by TF estimator API following the official document?

Medium: What are overfitting? What are L1 and L2 regularization? What should we do if the loss doesn’t converge? Can you implement a simple CNN without estimator API?

Reference G. R. Ritchie, I. Dunham, E. Zeggini, and P. Flicek. Functional annotation of noncoding sequence variants. Nature methods, 11(3):294, 2014. Wang, J., Dayem Ullah, A. Z., & Chelala, C. (2018). IW-Scoring: an Integrative Weighted Scoring framework for annotating and prioritizing genetic variations in the noncoding genome. Nucleic acids research, 46(8), e47-e47. Chen L, Jin P, Qin ZS (2016). DIVAN: Accurate identification of non-coding disease-specific risk variants based on multi-omics profiles Genome Biology 17:252

Solutions of tests Students, please post a link to your test results here.

Clone this wiki locally