Skip to content

vdblab/vdblab-shotgun

Repository files navigation

VDB Shotgun Pipeline

Prerequisites

  • Snakemake We currently recommend installing version 7.31.1 as later versions may be inconsistent with this pipeline.
  • Apptainer/Singularity: while in many cases we do provide conda envs the only method of execution we support is via containers.
  • (optional) A Snakemake Profile: this coordinates the execution of jobs on whatever hardware you are using.

Recommendations

  • set up a fresh conda virtual environment, install snakemake version 7.31.1 and python 3.10.9 and use this environment to run all analyses.
  • configure your .bashrc file to configure the $SNAKEMAKE_PROFILE variable to use the vdblab-profile (A private repo for vdblab members) or to point to whatever snakmake profile you will be using. At the same time add an $TMPDIR environmental variable definition to your .bashrc file to define where you would like to put temporary files. If doing this on lilac - recommended that you point this to a location in your /data/ directory.

Important Notes:

  • Set the location of your profile to the environment variable $SNAKEMAKE_PROFILE (eg export SNAKEMAKE_PROFILE=/path/to/your/profile/) (Recommended that you add this to the .bashrc file in your home directory to have this environmental variable instated upon startup.)
  • For the purposes of the examples, we added the --dry-run flag for the user to preview the rules to be executed. Remove this step to execute the commands.
  • All database paths are configured in config/config.yaml Change the paths to reflect where the databases can be found on your machine. For a uniform way to fetch and build all the databases, see https://github.com/vdblab/resources
  • If running analysis on SRA files, when specifying the config of your command set the dedup-platform=SRA switch. If this tag is not successfully set the pipeline will hang indefinitely at the dedup stage.

Simulating test data:

snakemake --snakefile .test/Snakefile --directory .test/simulated/

Main Pipeline

Usage

snakemake \
  --directory tmpout/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    nshards=4 \
    stage=all \
  --dry-run

Outputs

  • MultiQC-ready reports
  • Microbe relative abundances (MetaPhlAn3, Kraken2)
  • Metabolic pathway relative abundances (HUMAnN3)
  • Metagenome assembled genomes (MetaSPAdes)
  • AMR profiles with Abricate and RGI
  • MAGs with MetaWRAP (Metabat2, CONCOCT, Maxbin2)
  • Gene prediction and annotation (MetaErg)
  • Secondary metabolite gene clusters (antiSMASH)
  • Antimicrobial resistance and virulence genes (ABRicate, AMRFinderPlus)
  • Carbohydrate active enzyme (CAZyme) annotation (dbCAN3)

Workflow

The rule DAG for a single sample looks like this:

Main Shotgun Pipeline DAG

Different modules of the workflow can be run indenpendently using the stage config entry.

MultiQC

Just run MultiQC on a directory, no need to use Snakemake

cp -r tmppre/reports tmpreports
cp tmpassembly/quast/quast_473/report.tsv ./tmpreports/
ver="v1.12"
docker run -V $PWD:$PWD docker://ewels/multiqc:${ver} multiqc \
    --config vdb_shotgun/multiqc_config.yaml --force \
    --title "a multiqc report for some test data" \
    -b "generated by ${ver}" --filename multiqc_report.html \
    reports/ --interactive

Preprocessing

Shotgun Preprocessing Pipeline DAG

snakemake \
  --directory tmppreprocess/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    nshards=4 \
    dedup_platform=NovaSeq \
    stage=preprocess \
  --dry-run

Tools used

Biobakery

Shotgun Biobakery Profiling Pipeline DAG

snakemake \
  --directory tmpbiobakery/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    stage=biobakery \
  --dry-run

Tools used

Kraken2/Bracken

Shotgun Kraken/Bracken Pipeline DAG

snakemake \
  --directory tmpkraken/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    dedup_platform=NovaSeq \
    stage=kraken \
  --dry-run

Tools used

Assembly

Shotgun Assembly Pipeline DAG

snakemake \
  --directory tmpassembly/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    stage=assembly \
  --dry-run

Tools used

Annotation

Shotgun Assembly Annotation DAG

snakemake \
  --directory tmpannotate/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    assembly=tmpassembly/473.contigs.fasta \
    stage=annotate \
  --dry-run

Tools used

Binning

Shotgun Assembly Binning Pipeline DAG

snakemake \
  --directory tmpbinning/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    assembly=tmpassembly/473.contigs.fasta \
    stage=binning \
  --dry-run

RGI

Shotgun RGI Pipeline DAG

snakemake \
  --directory tmprgi/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    stage=rgi \
  --dry-run

Tools used

Strainphlan Pipeline

This pipeline StrainPhlAn for each specified species. Strainphlan requires two inputs: sample-level marker pickle files, and strain-level markers extracted from the main database. These are stored in central subdirectory in the Metaphlan database directory to aid re-running. If you provide the .sam.bz2 file for a samples that has already been processed into a pkl file, it will use the pregenerated result.

This workflow accepts as input a list of sample's metaphlan sam.bz2 alignment files, and a list of species of interest. A config argument strainphlan_markers_dir serves as a central place for storing both the species- and the sample-level marker files; these are specific to a version of the MetaPHlan database, so we recommend placing that within the metaphlan database directory.

Usage

snakemake \
  --snakefile workflow/strainphlan.smk \
  --directory tmpstrain/ \
  --config \
    sams=[path/to/sample1.sam.bz2,path/to/sample2.sam.bz2] \
    strainphlan_markers_dir=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/marker_outputs/ \
    metaphlan_db=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/ \
    marker_in_n_samples=2 \
  --dry-run

Outputs

For each input species:

  • Multiple sequence alignment of strains detected in samples
  • Phylogenetic tree of strains detected in samples

Workflow

The rule DAG for two example input species looks like this:

StrainPhlAn Shotgun Pipeline DAG

Testing and Development

Please see development.md.

About

Shotgun metagenomic sequencing processing pipeline

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors