Technologies being developed in the field of Bioinformatics and genomics allow us today, to determine the sequence of nucleotides of many and millions of DNA molecules in parallel. These technologies are allowing us today to not only take a closer look at DNA molecules and their composition, but also to conduct studies in different fields, like studying human genetic diseases, or analysing the interaction between proteins and the DNA, or even conduct unsupervised machine learning studies on the data to generate new insights and results. But reading DNA sequences usually comes with a catch : we can’t read the entire genome at once, so the data generated by these technologies usually comprises huge numbers of very short DNA sequences, that we call ’reads’. These reads need therefore, to be pieced together in a certain way to form the entire genome. A way to solve this complex puzzle, is Read Mapping.
This technique consists in mapping the reads back to a reference genome to determine their positional origin. Many different algorithms exist to accomplish read mapping.
In this project, it was asked to implement a read mapping software that is able to align a large set of reads to a given genome. The reads that will be used to test the algorithm are genreated using ChIP-seq, and the genome is the one of a Drosophila melanogaster “Oregon-R S2” strain. During the testing of our algorithm, I used a reduced dataset, since the original one is of a non negligeable size and takes a long time to compute.
You can find the code, the ressources as well as the report to this project in this repo.