bacterial_assembly

Assembly pipeline for baterial isolates using SPAdes

Installation

Download the zip file or clone it with

git clone https://github.com/kneubert/bacterial_assembly

All scripts need to be made executable:

chmod a+x bacterial_assembly/*.sh  
chmod a+x bacterial_assembly/scripts/*

Put the pipeline in your PATH, e.g. in your bashrc:
export PATH=$PATH:SRC_PATH/bacterial_assembly-master

Example run

First check, if all the programs under 'Prerequisites' are installed in your path.

1. Configuration

As a first step create a folder e.g. 'my_project' somewhere in your working directory and create the configuration file 'parameter.cfg'

mkdir my_project; cd my_project

The configuration file can look like this: parameter.cfg:

# [general parameter]
# number of threads that can be used by the pipeline
THREADS=32

# [QC & preprocessing]
# the location of the minikraken DB needs to be given minikrakenDB=/group/ag_abi/kneubert/soft/Kraken/krakenDB/minikraken_20171019_8GB

# [Assembly]
# directory that contains reference assemblies in subdirectories named my accession numbers:
# assembly_accession e.g. GCA_000008985.1
# ___ fasta file e.g. GCA_000008985.1_ASM898v1_genomic.fna
# ___ gene annotation file e.g. GCA_000008985.1_ASM898v1_genomic.gff
# ___ genbank file e.g. GCA_000008985.1_ASM898v1_genomic.gbff
# if 'REFERENCES' is not defined, Genbank references will be downloaded automatically for the given species
REFERENCES=/group/ag_abi/kneubert/References

# PAGIT istallation directory
PAGIT_HOME=/group/ag_abi/kneubert/soft/PAGIT

# Pilon executable jar file
PILON_JAR=/group/ag_abi/kneubert/soft/Pilon/pilon-1.22.jar

# the minimum coverage of filtered contigs
MIN_COV=5

# the minimim length of filtered contigs
MIN_LENGTH=500

# [Mapping]
# Bowtie2 directory
bowtie2_dir=/group/ag_abi/kneubert/soft/bowtie2-2.3.3.1-linux-x86_64

2. Run the assembly pipeline

To run a single sample call the pipeline script with the sample-ID, read directory and species as parameters: assembly_pipeline_SPAdes.sh [sample-ID] [read directory] [species]

For example:

assembly_pipeline_SPAdes.sh  16T0014 reads  'Francisella tularensis'

It can be useful to write all outputs to a log file:

assembly_pipeline_SPAdes.sh 16T0014 reads 'Francisella tularensis' 2>&1 |tee -a 16T0014.log

To run multiple samples, just create a bash script file e.g. 'jobs' that contains multiple runs and source it:

assembly_pipeline_SPAdes.sh 16T0014 reads 'Francisella tularensis' 2>&1 |tee -a 16T0014.log   
assembly_pipeline_SPAdes.sh 11T0315 reads 'Francisella tularensis' 2>&1 |tee -a 11T0315.log   
assembly_pipeline_SPAdes.sh FSC237 reads 'Francisella tularensis' 2>&1 |tee -a FSC237.log

source jobs

It is important, that the naming of the fastq files matches one of the following naming schemes, whereis _1 and _2 or _R1 and _R2 are the flags for the forward and reverse reads. Runs for the same sample and different sequencing runs/lanes are merged.

1.) [project-ID]_[sample-ID]_[library]_[sequencing run/lane]_[x_1/_2].fastq.gz
for example:
NG-11942_08T0013_lib171814_5228_2_1.fastq.gz
NG-11942_08T0013_lib171814_5228_2_2.fastq.gz

2.) [sample-ID]_xy_[library]_[R1/R2]_[lane]_[run/date].fastq.gz
for example:
ES-0001a_S07_L001_R1_001_20161019.fastq.gz
ES-0001a_S07_L001_R2_001_20161019.fastq.gz

3. Summarize results for quality metrics using mulitQC

After the runs have finished start the multiQC script in the project directory to summarize QC statistics before (preQC) and after the assembly (postQC).
This script should produce three folders preQC, postQC_contigs and postQC_scaffolds, that contain the QC reports in HTML format for the raw data (FastQC, Kraken), the contig assembly and the scaffold assembly (QUAST, QualiMap, Prokka). The HTML-reports can be opened with any Browser that supports Javascript.

Prerequisites

The following programs need to be installed:

De novo assembly (including read error correction)

SPAdes with options --careful --cov-cutoff auto (http://cab.spbu.ru/software/spades/)

Reference-based scaffolding of contigs

mummer is needed to compute Maximum Unique Matches Index (MUMi) values between the de novo assembly and several Genbank reference assemblies (http://mummer.sourceforge.net/)
PAGIT, the Post Assembly Genome Improvement Toolkit, is used for scaffolding (http://www.sanger.ac.uk/science/tools/pagit)
Pilon is used for assembly improvement (https://github.com/broadinstitute/pilon)
Bowtie2 as a prerequisite for Pilon (http://bowtie-bio.sourceforge.net/bowtie2)
Samtools 1.3.1 is used to convert and index alignment files (https://sourceforge.net/projects/samtools/files/samtools/1.3.1/)

Assembly QC

QUAST to check the quality of the contig and the scaffold assembly separately (http://quast.sourceforge.net/quast.html)
QualiMap to confirm mapping of reads to the reference assembly (http://qualimap.bioinfo.cipf.es/)
MultiQC is used to combine quality statistics from different tools in a HTML report (http://multiqc.info/)

Gene annotation

Prokka is used to predict genes from the de novo contig assembly or scaffolds (https://github.com/tseemann/prokka)

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
configs		configs
example		example
scripts		scripts
README.md		README.md
assembly_pipeline_SPAdes.sh		assembly_pipeline_SPAdes.sh
multiqc.sh		multiqc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bacterial_assembly

Installation

Example run

1. Configuration

2. Run the assembly pipeline

3. Summarize results for quality metrics using mulitQC

Prerequisites

Quality check of raw data

Adapter removal

Merge overlapping paired-end reads

De novo assembly (including read error correction)

Reference-based scaffolding of contigs

Assembly QC

Gene annotation

About

Uh oh!

Releases

Packages

Languages

kneubert/bacterial_assembly

Folders and files

Latest commit

History

Repository files navigation

bacterial_assembly

Installation

Example run

1. Configuration

2. Run the assembly pipeline

3. Summarize results for quality metrics using mulitQC

Prerequisites

Quality check of raw data

Adapter removal

Merge overlapping paired-end reads

De novo assembly (including read error correction)

Reference-based scaffolding of contigs

Assembly QC

Gene annotation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages