- Snakemake We currently recommend installing version 7.31.1 as later versions may be inconsistent with this pipeline.
- Apptainer/Singularity: while in many cases we do provide conda envs the only method of execution we support is via containers.
- (optional) A Snakemake Profile: this coordinates the execution of jobs on whatever hardware you are using.
- set up a fresh conda virtual environment, install snakemake version 7.31.1 and python 3.10.9 and use this environment to run all analyses.
- configure your .bashrc file to configure the
$SNAKEMAKE_PROFILEvariable to use the vdblab-profile (A private repo for vdblab members) or to point to whatever snakmake profile you will be using. At the same time add an$TMPDIRenvironmental variable definition to your .bashrc file to define where you would like to put temporary files. If doing this on lilac - recommended that you point this to a location in your /data/ directory.
- Set the location of your profile to the environment variable
$SNAKEMAKE_PROFILE(egexport SNAKEMAKE_PROFILE=/path/to/your/profile/) (Recommended that you add this to the .bashrc file in your home directory to have this environmental variable instated upon startup.) - For the purposes of the examples, we added the
--dry-runflag for the user to preview the rules to be executed. Remove this step to execute the commands. - All database paths are configured in
config/config.yamlChange the paths to reflect where the databases can be found on your machine. For a uniform way to fetch and build all the databases, see https://github.com/vdblab/resources - If running analysis on SRA files, when specifying the config of your command set the dedup-platform=SRA switch. If this tag is not successfully set the pipeline will hang indefinitely at the dedup stage.
snakemake --snakefile .test/Snakefile --directory .test/simulated/
snakemake \
--directory tmpout/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
nshards=4 \
stage=all \
--dry-run- MultiQC-ready reports
- Microbe relative abundances (MetaPhlAn3, Kraken2)
- Metabolic pathway relative abundances (HUMAnN3)
- Metagenome assembled genomes (MetaSPAdes)
- AMR profiles with Abricate and RGI
- MAGs with MetaWRAP (Metabat2, CONCOCT, Maxbin2)
- Gene prediction and annotation (MetaErg)
- Secondary metabolite gene clusters (antiSMASH)
- Antimicrobial resistance and virulence genes (ABRicate, AMRFinderPlus)
- Carbohydrate active enzyme (CAZyme) annotation (dbCAN3)
The rule DAG for a single sample looks like this:
Different modules of the workflow can be run indenpendently using the stage config entry.
Just run MultiQC on a directory, no need to use Snakemake
cp -r tmppre/reports tmpreports
cp tmpassembly/quast/quast_473/report.tsv ./tmpreports/
ver="v1.12"
docker run -V $PWD:$PWD docker://ewels/multiqc:${ver} multiqc \
--config vdb_shotgun/multiqc_config.yaml --force \
--title "a multiqc report for some test data" \
-b "generated by ${ver}" --filename multiqc_report.html \
reports/ --interactivesnakemake \
--directory tmppreprocess/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
nshards=4 \
dedup_platform=NovaSeq \
stage=preprocess \
--dry-run- BBTools (site | paper)
- SeqKit (site | paper)
- Bowtie2 (site | paper)
- Snap (site | [paper](https://doi.org/10.1101/2021.11.23.469039 History))
- SortMeRNA (site | paper)
- FastQC (site)
- 2-step host removal descibed here, extended to use both human and mouse genomes
snakemake \
--directory tmpbiobakery/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
stage=biobakery \
--dry-runsnakemake \
--directory tmpkraken/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
dedup_platform=NovaSeq \
stage=kraken \
--dry-runsnakemake \
--directory tmpassembly/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
stage=assembly \
--dry-runsnakemake \
--directory tmpannotate/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
assembly=tmpassembly/473.contigs.fasta \
stage=annotate \
--dry-run- MetaErg (site | paper)
- antiSMASH (site | paper)
- ABRicate (site)
- AMRFinderPlus (site | paper)
- dbCAN (site | paper)
snakemake \
--directory tmpbinning/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
assembly=tmpassembly/473.contigs.fasta \
stage=binning \
--dry-runsnakemake \
--directory tmprgi/ \
--config \
sample=473 \
R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
stage=rgi \
--dry-runThis pipeline StrainPhlAn for each specified species. Strainphlan requires two inputs: sample-level marker pickle files, and strain-level markers extracted from the main database. These are stored in central subdirectory in the Metaphlan database directory to aid re-running. If you provide the .sam.bz2 file for a samples that has already been processed into a pkl file, it will use the pregenerated result.
This workflow accepts as input a list of sample's metaphlan sam.bz2 alignment files, and a list of species of interest. A config argument strainphlan_markers_dir serves as a central place for storing both the species- and the sample-level marker files; these are specific to a version of the MetaPHlan database, so we recommend placing that within the metaphlan database directory.
snakemake \
--snakefile workflow/strainphlan.smk \
--directory tmpstrain/ \
--config \
sams=[path/to/sample1.sam.bz2,path/to/sample2.sam.bz2] \
strainphlan_markers_dir=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/marker_outputs/ \
metaphlan_db=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/ \
marker_in_n_samples=2 \
--dry-runFor each input species:
- Multiple sequence alignment of strains detected in samples
- Phylogenetic tree of strains detected in samples
The rule DAG for two example input species looks like this:
Please see development.md.








