Skip to content

kneubert/bacterial_assembly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bacterial_assembly

Assembly pipeline for baterial isolates using SPAdes

Installation

Download the zip file or clone it with

git clone https://github.com/kneubert/bacterial_assembly

All scripts need to be made executable:

chmod a+x bacterial_assembly/*.sh  
chmod a+x bacterial_assembly/scripts/*  

Put the pipeline in your PATH, e.g. in your bashrc:
export PATH=$PATH:SRC_PATH/bacterial_assembly-master

Example run

First check, if all the programs under 'Prerequisites' are installed in your path.

1. Configuration

As a first step create a folder e.g. 'my_project' somewhere in your working directory and create the configuration file 'parameter.cfg'

mkdir my_project; cd my_project  

The configuration file can look like this: parameter.cfg:

# [general parameter]
# number of threads that can be used by the pipeline
THREADS=32

# [QC & preprocessing]
# the location of the minikraken DB needs to be given minikrakenDB=/group/ag_abi/kneubert/soft/Kraken/krakenDB/minikraken_20171019_8GB

# [Assembly]
# directory that contains reference assemblies in subdirectories named my accession numbers:
# assembly_accession e.g. GCA_000008985.1
# ___ fasta file e.g. GCA_000008985.1_ASM898v1_genomic.fna
# ___ gene annotation file e.g. GCA_000008985.1_ASM898v1_genomic.gff
# ___ genbank file e.g. GCA_000008985.1_ASM898v1_genomic.gbff
# if 'REFERENCES' is not defined, Genbank references will be downloaded automatically for the given species
REFERENCES=/group/ag_abi/kneubert/References

# PAGIT istallation directory
PAGIT_HOME=/group/ag_abi/kneubert/soft/PAGIT

# Pilon executable jar file
PILON_JAR=/group/ag_abi/kneubert/soft/Pilon/pilon-1.22.jar

# the minimum coverage of filtered contigs
MIN_COV=5

# the minimim length of filtered contigs
MIN_LENGTH=500

# [Mapping]
# Bowtie2 directory
bowtie2_dir=/group/ag_abi/kneubert/soft/bowtie2-2.3.3.1-linux-x86_64

2. Run the assembly pipeline

To run a single sample call the pipeline script with the sample-ID, read directory and species as parameters: assembly_pipeline_SPAdes.sh [sample-ID] [read directory] [species]

For example:

assembly_pipeline_SPAdes.sh  16T0014 reads  'Francisella tularensis'   

It can be useful to write all outputs to a log file:

assembly_pipeline_SPAdes.sh 16T0014 reads 'Francisella tularensis' 2>&1 |tee -a 16T0014.log   

To run multiple samples, just create a bash script file e.g. 'jobs' that contains multiple runs and source it:

assembly_pipeline_SPAdes.sh 16T0014 reads 'Francisella tularensis' 2>&1 |tee -a 16T0014.log   
assembly_pipeline_SPAdes.sh 11T0315 reads 'Francisella tularensis' 2>&1 |tee -a 11T0315.log   
assembly_pipeline_SPAdes.sh FSC237 reads 'Francisella tularensis' 2>&1 |tee -a FSC237.log    
source jobs 

It is important, that the naming of the fastq files matches one of the following naming schemes, whereis _1 and _2 or _R1 and _R2 are the flags for the forward and reverse reads. Runs for the same sample and different sequencing runs/lanes are merged.

1.) [project-ID]_[sample-ID]_[library]_[sequencing run/lane]_[x_1/_2].fastq.gz
for example:
NG-11942_08T0013_lib171814_5228_2_1.fastq.gz
NG-11942_08T0013_lib171814_5228_2_2.fastq.gz

2.) [sample-ID]_xy_[library]_[R1/R2]_[lane]_[run/date].fastq.gz
for example:
ES-0001a_S07_L001_R1_001_20161019.fastq.gz
ES-0001a_S07_L001_R2_001_20161019.fastq.gz

3. Summarize results for quality metrics using mulitQC

After the runs have finished start the multiQC script in the project directory to summarize QC statistics before (preQC) and after the assembly (postQC).
This script should produce three folders preQC, postQC_contigs and postQC_scaffolds, that contain the QC reports in HTML format for the raw data (FastQC, Kraken), the contig assembly and the scaffold assembly (QUAST, QualiMap, Prokka). The HTML-reports can be opened with any Browser that supports Javascript.

Prerequisites

The following programs need to be installed:

Quality check of raw data

Adapter removal

Merge overlapping paired-end reads

De novo assembly (including read error correction)

Reference-based scaffolding of contigs

Assembly QC

Gene annotation

About

Assembly pipeline for baterial isolates using SPAdes

Resources

Stars

Watchers

Forks

Packages

No packages published