Assembly pipeline for baterial isolates using SPAdes
Download the zip file or clone it with
git clone https://github.com/kneubert/bacterial_assemblyAll scripts need to be made executable:
chmod a+x bacterial_assembly/*.sh
chmod a+x bacterial_assembly/scripts/* Put the pipeline in your PATH, e.g. in your bashrc:
export PATH=$PATH:SRC_PATH/bacterial_assembly-master
First check, if all the programs under 'Prerequisites' are installed in your path.
As a first step create a folder e.g. 'my_project' somewhere in your working directory and create the configuration file 'parameter.cfg'
mkdir my_project; cd my_project The configuration file can look like this: parameter.cfg:
# [general parameter]
# number of threads that can be used by the pipeline
THREADS=32
# [QC & preprocessing]
# the location of the minikraken DB needs to be given
minikrakenDB=/group/ag_abi/kneubert/soft/Kraken/krakenDB/minikraken_20171019_8GB
# [Assembly]
# directory that contains reference assemblies in subdirectories named my accession numbers:
# assembly_accession e.g. GCA_000008985.1
# ___ fasta file e.g. GCA_000008985.1_ASM898v1_genomic.fna
# ___ gene annotation file e.g. GCA_000008985.1_ASM898v1_genomic.gff
# ___ genbank file e.g. GCA_000008985.1_ASM898v1_genomic.gbff
# if 'REFERENCES' is not defined, Genbank references will be downloaded automatically for the given species
REFERENCES=/group/ag_abi/kneubert/References
# PAGIT istallation directory
PAGIT_HOME=/group/ag_abi/kneubert/soft/PAGIT
# Pilon executable jar file
PILON_JAR=/group/ag_abi/kneubert/soft/Pilon/pilon-1.22.jar
# the minimum coverage of filtered contigs
MIN_COV=5
# the minimim length of filtered contigs
MIN_LENGTH=500
# [Mapping]
# Bowtie2 directory
bowtie2_dir=/group/ag_abi/kneubert/soft/bowtie2-2.3.3.1-linux-x86_64
To run a single sample call the pipeline script with the sample-ID, read directory and species as parameters: assembly_pipeline_SPAdes.sh [sample-ID] [read directory] [species]
For example:
assembly_pipeline_SPAdes.sh 16T0014 reads 'Francisella tularensis' It can be useful to write all outputs to a log file:
assembly_pipeline_SPAdes.sh 16T0014 reads 'Francisella tularensis' 2>&1 |tee -a 16T0014.log To run multiple samples, just create a bash script file e.g. 'jobs' that contains multiple runs and source it:
assembly_pipeline_SPAdes.sh 16T0014 reads 'Francisella tularensis' 2>&1 |tee -a 16T0014.log
assembly_pipeline_SPAdes.sh 11T0315 reads 'Francisella tularensis' 2>&1 |tee -a 11T0315.log
assembly_pipeline_SPAdes.sh FSC237 reads 'Francisella tularensis' 2>&1 |tee -a FSC237.log source jobs It is important, that the naming of the fastq files matches one of the following naming schemes, whereis _1 and _2 or _R1 and _R2 are the flags for the forward and reverse reads. Runs for the same sample and different sequencing runs/lanes are merged.
1.) [project-ID]_[sample-ID]_[library]_[sequencing run/lane]_[x_1/_2].fastq.gz
for example:
NG-11942_08T0013_lib171814_5228_2_1.fastq.gz
NG-11942_08T0013_lib171814_5228_2_2.fastq.gz
2.) [sample-ID]_xy_[library]_[R1/R2]_[lane]_[run/date].fastq.gz
for example:
ES-0001a_S07_L001_R1_001_20161019.fastq.gz
ES-0001a_S07_L001_R2_001_20161019.fastq.gz
After the runs have finished start the multiQC script in the project directory to summarize QC statistics before (preQC) and after the assembly (postQC).
This script should produce three folders preQC, postQC_contigs and postQC_scaffolds, that contain the QC reports in HTML format for the raw data (FastQC, Kraken), the contig assembly and the scaffold assembly (QUAST, QualiMap, Prokka). The HTML-reports can be opened with any Browser that supports Javascript.
The following programs need to be installed:
- FastQC (https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc)
- Kraken with the mini-kraken DB (8 GB) (https://ccb.jhu.edu/software/kraken/)
- Flexbar 3.0.3 (https://github.com/seqan/flexbar)
- FLASH2 (https://github.com/dstreett/FLASH2)
- SPAdes with options --careful --cov-cutoff auto (http://cab.spbu.ru/software/spades/)
- mummer is needed to compute Maximum Unique Matches Index (MUMi) values between the de novo assembly and several Genbank reference assemblies (http://mummer.sourceforge.net/)
- PAGIT, the Post Assembly Genome Improvement Toolkit, is used for scaffolding (http://www.sanger.ac.uk/science/tools/pagit)
- Pilon is used for assembly improvement (https://github.com/broadinstitute/pilon)
- Bowtie2 as a prerequisite for Pilon (http://bowtie-bio.sourceforge.net/bowtie2)
- Samtools 1.3.1 is used to convert and index alignment files (https://sourceforge.net/projects/samtools/files/samtools/1.3.1/)
- QUAST to check the quality of the contig and the scaffold assembly separately (http://quast.sourceforge.net/quast.html)
- QualiMap to confirm mapping of reads to the reference assembly (http://qualimap.bioinfo.cipf.es/)
- MultiQC is used to combine quality statistics from different tools in a HTML report (http://multiqc.info/)
- Prokka is used to predict genes from the de novo contig assembly or scaffolds (https://github.com/tseemann/prokka)