Skip to content

bartongroup/PT_kraken2_16s_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kraken 16S → Phyloseq Pipeline

Overview

This repository contains a Kraken2-based workflow for analysing low-complexity 16S amplicon sequencing data.

Instead of generating ASVs (e.g. via DADA2, which takes ages), this pipeline:

  • trims primers and performs basic quality filtering
  • classifies read pairs using Kraken2 with 16S-specific databases
  • converts taxonomic profiles into BIOM format
  • merges samples into a single abundance table
  • imports the result into R as a phyloseq object

Pipeline Summary

FASTQ
  ↓
Primer trimming + quality filtering (cutadapt)
  ↓
Taxonomic classification (Kraken2)
  ↓
Per-sample BIOM tables
  ↓
Merged BIOM table
  ↓
Phyloseq object (R)

Scripts

download_kraken_DB.sh

Purpose

Downloads Kraken2-compatible 16S reference databases.

Why

Kraken2 requires pre-built taxonomy databases. This script retrieves:

  • SILVA 16S database
  • Greengenes 16S database

This allows classification to be performed against both reference systems for comparison.

Output

kraken_databases/
  ├── 16S_SILVA138_k2db/
  └── 16S_Greengenes_k2db/

run_kraken16s_cutadapt.py

Purpose

Core processing script.

Processes paired FASTQ files to generate per-sample taxonomic abundance tables.

What it does

For each sample:

  1. Identifies paired FASTQ files
  2. Optionally trims primers using cutadapt
  3. Applies basic filtering:
    • removes reads with Ns
    • minimum length filtering
    • quality trimming
  4. Runs Kraken2 classification
  5. Converts Kraken reports into BIOM format
  6. Merges per-sample BIOMs into a single table

Why

This replaces ASV-based workflows for low complexity datasets, known taxa mixtures, and fast taxonomic profiling.

Outputs

out_<db>/
  ├── trimmed/
  ├── kraken2/
  ├── biom/
  │     ├── <sample>.biom.json
  │     └── merged.biom.json
  ├── logs/
  └── metadata/

run_kraken.sh

Purpose

Batch driver script for the full analysis.

What it does

Runs the pipeline across:

  • SILVA database
  • Greengenes database

For each database:

  1. Executes trimming + Kraken classification
  2. Produces merged BIOM table
  3. Imports into phyloseq

Why

Running both databases enables reference comparison and taxonomic consistency validation.


import_kraken_biom_to_phyloseq.R

Purpose

Converts merged BIOM output into an R phyloseq object.

What it does

  1. Reads merged BIOM table
  2. Loads sample metadata
  3. Aligns sample IDs
  4. Constructs phyloseq object containing:
    • OTU table
    • taxonomy table
    • sample metadata
  5. Saves:
phyloseq_<db>.rds

Why

Phyloseq provides a unified structure for diversity analysis, ordination, community composition analysis, and treatment comparisons.


Requirements

can install via conda using the env file in envs

conda env create -f kraken16s.yml

Software

  • Python ≥ 3.8
  • R ≥ 4.0
  • cutadapt
  • Kraken2
  • kraken-biom

R Packages

Install in R:

install.packages("phyloseq")
install.packages("biomformat")

Input Requirements

FASTQ Files

Paired-end reads located in:

../linked_fastqs/

Naming must support pairing:

*_R1*.fastq.gz
*_R2*.fastq.gz

Metadata

Tab-separated file with sample IDs.

Example:

SampleID   Treatment   Site
ED001      Live        Field
ED002      Control     Field

Outputs

Final outputs:

phyloseq_silva.rds
phyloseq_greengenes.rds

These are ready for downstream microbiome analysis.


Example Downstream Analysis

Load phyloseq object:

library(phyloseq)
ps <- readRDS("phyloseq_silva.rds")

Check structure:

nsamples(ps)
ntaxa(ps)
sample_data(ps)

Convert to relative abundance:

ps_rel <- transform_sample_counts(ps, function(x) x / sum(x))

Ordination:

ord <- ordinate(ps_rel, method="PCoA", distance="bray")
plot_ordination(ps_rel, ord, colour="Treatment")

When to Use This Pipeline

Recommended for:

  • synthetic communities
  • low diversity datasets
  • targeted microbiome experiments
  • rapid taxonomic profiling

Less suitable for:

  • high-diversity microbiomes
  • strain-level resolution
  • sequence-variant-based ecological modelling

Conceptual Comparison

DADA2 Kraken Workflow
ASV-based Taxonomy-based
Sequence resolution Taxonomic resolution
Slower Fast
High precision Suitable for simple communities

Summary

This pipeline provides a fast, reproducible route from raw FASTQ files to an analysis-ready phyloseq object using Kraken2 classification.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors