Denoising pipeline using DADA2 algorithm to process raw .gz sequencing files from paired-end MiSeq Illumina sequencing.
The pipeline currently takes care of trimming standard Illumina adapters from reads, filter and merge the reads (denoising), and determine the taxonomy of the different identified ASVs, as well as aligning them.
Moreover, the pipeline results in the creation of a Phyloseq object, containing the processed samples and their metadata, for further downstream analysis in R.
The intended contents of each directory is explained in separate README.md files.
This approach keeps each analysis run isolated in its own directory.
- Use the helper script to set up a new run:
./setup_run.sh my_run
Or manually create the structure:
mkdir -p runs/my_run/data/{raw_external,db,meta}
cp config_templates/basic.yaml runs/my_run/config.yaml
- Put your raw data, database and metadata in:
runs/my_run/data/raw_external/- Your .fastq.gz filesruns/my_run/data/db/- SILVA database fileruns/my_run/data/meta/- metadata.tsv file
Metadata file should be in .tsv format, the names of the raw files should follow the convention "{your_sample}.R1.fastq.gz" to work.
-
Edit
runs/my_run/config.yamlto customize settings if needed. -
Run the pipeline from the repository root:
snakemake --configfile runs/my_run/config.yaml --directory runs/my_run --use-conda --conda-prefix ./.snakemake/conda --cores all allNote: setup_run.sh sets a default shared conda prefix at ./.snakemake/conda (you can override this by exporting CONDA_PREFIX). Using a shared --conda-prefix prevents Snakemake from downloading duplicate environments into each runs/<id>/.snakemake/conda.
If you use a repo-local shared prefix, add /.snakemake/ to .gitignore to avoid committing environment files.
This will create all outputs (results/, intermediate/, etc.) inside runs/my_run/, keeping your runs isolated.
For backwards compatibility, you can still run from the repository root:
- Put your raw data, database and metadata in:
- data/raw_external
- data/db
- data/meta
- Run the pipeline:
snakemake --use-conda --cores all all
The config file supports the following options:
preprocess: "yes"or"no"- Run preprocessing steps (FastQC, MultiQC, Trim_galore)phylogeny: "yes"or"no"- Build phylogenetic alignment and tree
You can also override these via command line:
--config preprocess="no" phylogeny="yes"
Just note that trimming doesn't happen in the pipeline when using DADA2, so if no pre-processing takes place, the sequences will not be trimmed at all.
the --cores flag specifies the amount of cores to use, you can select what you think works best.