This repository contains a Kraken2-based workflow for analysing low-complexity 16S amplicon sequencing data.
Instead of generating ASVs (e.g. via DADA2, which takes ages), this pipeline:
- trims primers and performs basic quality filtering
- classifies read pairs using Kraken2 with 16S-specific databases
- converts taxonomic profiles into BIOM format
- merges samples into a single abundance table
- imports the result into R as a phyloseq object
FASTQ
↓
Primer trimming + quality filtering (cutadapt)
↓
Taxonomic classification (Kraken2)
↓
Per-sample BIOM tables
↓
Merged BIOM table
↓
Phyloseq object (R)
Purpose
Downloads Kraken2-compatible 16S reference databases.
Why
Kraken2 requires pre-built taxonomy databases. This script retrieves:
- SILVA 16S database
- Greengenes 16S database
This allows classification to be performed against both reference systems for comparison.
Output
kraken_databases/
├── 16S_SILVA138_k2db/
└── 16S_Greengenes_k2db/
Purpose
Core processing script.
Processes paired FASTQ files to generate per-sample taxonomic abundance tables.
What it does
For each sample:
- Identifies paired FASTQ files
- Optionally trims primers using cutadapt
- Applies basic filtering:
- removes reads with Ns
- minimum length filtering
- quality trimming
- Runs Kraken2 classification
- Converts Kraken reports into BIOM format
- Merges per-sample BIOMs into a single table
Why
This replaces ASV-based workflows for low complexity datasets, known taxa mixtures, and fast taxonomic profiling.
Outputs
out_<db>/
├── trimmed/
├── kraken2/
├── biom/
│ ├── <sample>.biom.json
│ └── merged.biom.json
├── logs/
└── metadata/
Purpose
Batch driver script for the full analysis.
What it does
Runs the pipeline across:
- SILVA database
- Greengenes database
For each database:
- Executes trimming + Kraken classification
- Produces merged BIOM table
- Imports into phyloseq
Why
Running both databases enables reference comparison and taxonomic consistency validation.
Purpose
Converts merged BIOM output into an R phyloseq object.
What it does
- Reads merged BIOM table
- Loads sample metadata
- Aligns sample IDs
- Constructs phyloseq object containing:
- OTU table
- taxonomy table
- sample metadata
- Saves:
phyloseq_<db>.rds
Why
Phyloseq provides a unified structure for diversity analysis, ordination, community composition analysis, and treatment comparisons.
can install via conda using the env file in envs
conda env create -f kraken16s.yml
- Python ≥ 3.8
- R ≥ 4.0
- cutadapt
- Kraken2
- kraken-biom
Install in R:
install.packages("phyloseq")
install.packages("biomformat")Paired-end reads located in:
../linked_fastqs/
Naming must support pairing:
*_R1*.fastq.gz
*_R2*.fastq.gz
Tab-separated file with sample IDs.
Example:
SampleID Treatment Site
ED001 Live Field
ED002 Control Field
Final outputs:
phyloseq_silva.rds
phyloseq_greengenes.rds
These are ready for downstream microbiome analysis.
Load phyloseq object:
library(phyloseq)
ps <- readRDS("phyloseq_silva.rds")Check structure:
nsamples(ps)
ntaxa(ps)
sample_data(ps)Convert to relative abundance:
ps_rel <- transform_sample_counts(ps, function(x) x / sum(x))Ordination:
ord <- ordinate(ps_rel, method="PCoA", distance="bray")
plot_ordination(ps_rel, ord, colour="Treatment")Recommended for:
- synthetic communities
- low diversity datasets
- targeted microbiome experiments
- rapid taxonomic profiling
Less suitable for:
- high-diversity microbiomes
- strain-level resolution
- sequence-variant-based ecological modelling
| DADA2 | Kraken Workflow |
|---|---|
| ASV-based | Taxonomy-based |
| Sequence resolution | Taxonomic resolution |
| Slower | Fast |
| High precision | Suitable for simple communities |
This pipeline provides a fast, reproducible route from raw FASTQ files to an analysis-ready phyloseq object using Kraken2 classification.