This repository contains the implementation and optimization of the INDELseek algorithm, an open-source tool designed for detecting complex insertions and deletions (indels) in Next-Generation Sequencing (NGS) data, as described the "INDELseek: detection of complex insertions and deletions from next-generation sequencing data" by Au C.H. et. al. Originally implemented in Perl, INDELseek is a crucial tool for identifying genetic variations that may play significant roles in diseases, including cancer. Our project focuses on re-implementing the algorithm in Python and C++ to enhance performance and usability, as well as comparing it to existing tools like GATK.
This analysis was performed as part of the final project for the "Algorithms in Molecular Biology" graduate course of the MSc Data Science & Information Technologies Master's programme (Bioinformatics - Biomedical Data Science Specialization) of the Department of Informatics and Telecommunications department of the National and Kapodistrian University of Athens (NKUA), under the supervision of professors Theodore Dalamagas and George Georgakilas, in the academic year 2022-2023.
- Original INDELseek Algorithm: Perl script enhanced with execution time tracking and memory logging.
- Python Implementation: Uses dictionaries for caching and simplifies debugging with modular design.
- C++ Implementation: Incorporates hash maps for caching, custom structs for memory efficiency, and a
Makefilefor easy compilation. - Memory Usage Visualization:
graph.shgenerates graphs for RAM usage over time during execution.
- Datasets: NGS datasets including NA12878 (1.5 billion reads) and HG00148.chr11 (4 million reads) were used.
- Computing Environment: Testing was performed locally and on a Hypatia VM (7 cores, 60GB RAM).
- Dependencies: Scripts rely on
samtoolsfor sequence processing and alignment.
- Execution Time: The C++ implementation demonstrated significant speed improvements (40 minutes for ~4M reads) compared to Python (5 hours 42 minutes) and Perl (5 hours 49 minutes).
- Memory Usage: Resource footprint was consistently low (~600MB RAM for a 4M reads dataset).
- Insights: Compiled languages like C++ show clear performance advantages, albeit at increased coding complexity.
git clone https://github.com/GiatrasKon/INDELseek-Optimized-Implementations.gitsamtools- Processes BAM/SAM files for read alignments.
- Fetches reference sequences for specific regions using
faidx. - Computes depth of coverage using
depth. - Installation:
sudo apt install samtoolsgnuplot- Used in
graph.shto generate memory usage graphs for the running process. - Installation:
- Used in
sudo apt install gnuplotperl- The original INDELseek implementation is written in Perl. If you plan to compare outputs, ensure Perl is installed with the required modules like
Getopt::Long. - Installation:
- The original INDELseek implementation is written in Perl. If you plan to compare outputs, ensure Perl is installed with the required modules like
sudo apt install perlPython:
Ensure you have Python 3.x installed. Use the following commands to install required packages:
pip install argparse psutil subprocess datetime time collections pprintC++ and Perl:
- Ensure
gcc(C++ compiler) andmakeare installed. - Install
samtools(v1.3 or higher). - Verify Perl is installed with required modules like
Getopt::Long. - Compile the C++ Implementation: Navigate to the scripts directory and run:
make- NA12878 BAM File:
- This is a 113GB BAM file containing approximately 1.5 billion reads. It is aligned to the hg38 reference genome.
- Used to attempt reproduction of the original INDELseek results.
- Extracted subsets (chromosomes 20 and 15) for faster testing and analysis.
- How to Obtain:
- Visit Illumina BaseSpace (requires an account).
- Look for the NA12878 sample in publicly available datasets or the Genome in a Bottle (GIAB) project.
- HG00148.chr11 BAM File:
- A BAM file containing approximately 4 million reads focused on chromosome 11.
- Used for benchmarking the Python and C++ implementations.
- Provides a smaller, manageable dataset for performance comparisons.
- How to Obtain:
- Visit the 1000 Genomes Project website.
- Download the HG00148 dataset.
- Reference Genome (hg38):
- A FASTA file of the human genome (hg38 version) used for sequence alignment and comparison.
- Required for samtools operations like faidx and depth.
- Essential for running all INDELseek implementations.
- How to Obtain:
How to Download and Prepare the Datasets:
- NA12878 and HG00148: - BAM files can be downloaded from their respective sources using web interfaces or wget/curl commands (if direct URLs are provided).
- Reference Genome: - Download the hg38.fa file and index it using samtools:
samtools faidx hg38.faNotes for Users:
- Ensure sufficient storage space for large datasets.
- Validate downloaded files (e.g., by comparing checksums).
- Ensure compatibility between the reference genome version and BAM file alignments.
- Follow ethical guidelines and licensing agreements when using public datasets.
Python:
python indelseek.py --ref_genome path/to/reference.fa --input_bam path/to/input.bamC++:
./indelseek --reference path/to/reference.fa --input_bam path/to/input.bam --output_vcf output.vcfRefer to the documents directory for the project presentation, report, and the original authors' paper.