Skip to content

A scalable Nextflow pipeline for automated WGS analysis, optimized for Oxford Nanopore and hybrid ONT–Illumina assemblies in clinical microbiology research.

License

Notifications You must be signed in to change notification settings

AMRmicrobiology/ONT_BACTERIAL_ANALYSIS

Repository files navigation

ONT_BACTERIAL_ANALYSIS

Contributors Forks Stargazers Issues license-shield

Introduction

This Nextflow pipeline provides an automated, reproducible and scalable solution for whole-genome sequencing (WGS) analysis in clinical microbiology research optimised for Oxford Nanopore Technology (ONT) data. It also supports Illumina data for the novo hybrid assemblies.

Contents

Pipeline summary

All modes in the pipeline include the following steps:

  1. Long reads QC and trimming: Assessment of read quality before and after filtering using Nanoplot. Filtering of low-quality bases and short reads is performed using Filtlong followed by removal of ONT adapter sequences using Porechop. All Nanoplot reports are summarised using Nanocomp.

  2. Contaminant sequence removal: The taxonomic sequence classifier Kraken2 is used to identify contaminant non-bacterial reads followed by SEQTK to filter out all reads flagged as contaminants.

From this point onwards two modes are available: If only ONT data is available --mode assemble; if both ONT and Illumina data are available, you should select --mode hybrid to perform a hybrid assembly.

mode --assemble

  1. Assembly: De novo assembly using the single-molecule assembler Flye followed by multiple rounds of polishing and the construction of a consensus sequence using Medaka. Genome assemblies are then reoriented using dnaapler.

    • Polishing process: The optimal number of polishing rounds is determined automatically using the CART algorithm. The prediction is based on multiple parameters, which include error rate, N50/L50, genome coverage, Total Length of Matches, Average Occurrences, Distinct Minimizers, and processing time per round.

      Source Parameter Description
      Minimap2 DistinctMinimizers Number of unique minimizers found (Minimap2 value),change < 0.1% in distinct minimizers
      AverageOccurrences Average occurrences of minimizers (Minimap2),change < 0.01 in average occurrences
      TotalLengthMatches Total length of aligned matches,change < 0.1%
      ProcessingTime Total execution time per round (Racon or Minimap2), change < 5%
      RACON Processing Time Change < 5%
      QUAST N50/L50 Minimum contig length that covers 50% of the assembly, change < 100 bp
      QUAST/MEDAKA ErrorRate Error rate in the sequence after each polishing round
      BUSCO Completeness (BUSCO) Change < 1% in complete genes
      Target Value Optional Rounds Optimal number of rounds needed to achieve convergence

mode --hybrid

  1. Assembly: The Autocycler tool is used to generate a consensus de novo long-read assembly by combining multiple alternative assemblies produced by different assemblers (e.g. Canu, Flye, NextDenovo, etc.). Afterwards, the consensus long-read genome is reoriented using dnaapler and then polished with the short Illumina reads following these steps:

    • Short reads QC and trimming: Trimming and filtering of low-quality bases and short reads are performed with Fastp. Short read quality is assessed before and after trimming using FastQC, and summarised using MultiQC.

    • Mapping and polishing: Short reads are mapped to the consensus genome assembly using BWA-MEM, followed by a filtering and polishing step to improve the assembly using Polipolish.


After the consensus genome assemblies have been generated, all assemblies are processed using the same workflow:

  1. Assembly and genome QC: Structural quality metrics are evaluated with QUAST, and genome completeness is assessed using BUSCO. A final combined report is generated with MultiQC.

  2. Annotation: Genome annotation is performed using both Prokka and Bakta. The resulting GFF annotation files from both annotation tools are cleaned and combined using AGAT.

  3. Post-assembly analyses:

    • Mass screening of contigs for antimicrobial resistance and virulence genes using ABRIcate.
    • Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.
    • Screening of genomes against traditional PubMLST schemes using MLST.
    • Plasmid analysis: In case --plasmid option is added in the command line, the mob-suite tool is used to predict and identify the plasmid sequences from the assemblies.

Installation

Prerequisites to run the pipeline:

Clone the Repository:

# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/ONT_BACTERIAL_ANALYSIS.git

# Move inside the main directory
cd ONT_BACTERIAL_ANALYSIS

Local (Singularity)

If you are running the pipeline locally, remember to define the path for Singularity temporary files and cache:

SINGULARITY_TMPDIR=/PATH/singularity/tmp
SINGULARITY_CACHEDIR=/PATH/singularity/cache
TMPDIR=/PATH/singularity/tmp
export NFX_SINGULARITY_CACHEDIR =/PATH/singularity/tmp

e.g:

SINGULARITY_TMPDIR=/mnt/dades/singularity/tmp
SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp
TMPDIR=/mnt/dades/singularity/tmp
export NFX_SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp

export APPTAINER_TMPDIR=/mnt/dades/singularity/tmp
export APPTAINER_CACHEDIR=/mnt/dades/singularity/cache
export NXF_SINGULARITY_CACHEDIR=/mnt/dades/singularity/cache
export APPTAINERENV_NXF_TASK_WORKDIR=/mnt/dades/singularity/tmp
export APPTAINERENV_TMPDIR=/mnt/dades/singularity/tmp

Note

Conda environments are listed and created but have not been tested.

How to use it?

Inside the ONT_BACTERIAL_ANALYSIS directory, modify the file barcode_info.csv to add the expected genome size (bp) and sample code you want to assign to each barcode:

Important

The sample code names should not include "-".

e.g.

barcode,genome_size,sample_code
barcode01,3000000,306 
barcode02,4500000,C2_72
barcode03,5000000,C2_75
barcode04,4000000,C2_76
barcode05,3200000,ST89
barcode06,3500000,ST23

Run the pipeline using the following command, adjusting the parameters as needed:

ASSEMBLE

nextflow run main.nf --mode assemble --genome_size_file barcode_info.csv --input '/path/to/data/barcode*' -profile <docker/singularity/conda>

HYBRID

nextflow run main.nf --mode hybrid --genome_size_file barcode_info.csv --input '/path/to/data/barcode*' --short_reads '/path/to/data/*_{1,2}.fastq.gz' -profile <docker/singularity/conda>

Usage and parameters

Usage: nextflow run main.nf [--help] [--mode VAR] [--genome_size_file VAR] [--input VAR] [--short_reads VAR] [--outdir VAR] [--organism VAR] [--min_length VAR] [--min_mean_q VAR] [--keep_percent VAR] [--plasmid] [--bakta_db_define VAR] [--db_select VAR] [--abricate_db VAR] [-w VAR] [-profile VAR]
  
Input data arguments
  --mode             TEXT        Selection of the pipeline assemble/hybrid [required]
  --input            PATH        Input barcode* folder(s) containing the long reads .fastq.gz files [required]
  --genome_size_file PATH        Path to the .csv file with barcode, size and sample name information [required]
  Pipeline specific
  --short_reads     PATH        (--mode hybrid) Input FASTQ paired-end files named *_{1,2} (.fastq.gz format) [required]
  
Nextflow arguments
  -profile           TEXT        Selection of execution profile (docker, singularity or conda) [required]
  -w                 PATH        Path to the work dir. where temporary files will be written [default: ./work ]
  
Output arguments
  --outdir           PATH        Directory to write the output [default: ./out]
  
Optional arguments 
  --help                         Show this message and exit      
  --plasmid          BOOLEAN     Add this parameter to identify and type plasmid sequences in your assembly [default: false]
  
Long-read filtering arguments
  --min_length       INTEGER     Minimum length threshold (bp) [default: 1000]
  --min_mean_q       INTEGER     Minimum mean quality threshold [default: 10]
  --keep_percent     INTEGER     Throw out the worst (100-x)% of read bases [default: 90].
  
AMR arguments
  --organism         TEXT        By default, ABRicate searches the following databases: vfdb_full, resfinder, plasmidfinder, and card. If Escherichia coli or Klebsiella pneumoniae is specified, ecoli_vf and argannot will be searched, respectively, instead of vfdb_full [default: ""]. 
  
Databases arguments
  --bakta_db_define  PATH        Define the path to the user downloaded database to be used by Bakta. By default the database is downloaded if no argument is added. Another option is to copy-paste the database directly to the "./bakta_db" directory
  --db_select        TEXT/PATH   Kraken2 database to use for taxonomy classification. The options "db_16GB" or "db_full_60GB" are downloaded automatically if specified. Alternatively, a path to a user-provided database may be supplied. Another option is to copy-paste the database directly into the "./kraken_db" directory [default: "db_16GB"]
  --abricate_db      PATH        Path to the user downloaded databases to be used by Abricate

Output

This is the forder architecture and the content of the output data directory:

Folder Subfolder Description
1-QC data_QC Individual initial Nanoplot results folders and Nanocomp summary of all reports before and after trimming
genome_QC Individual BUSCO and QUAST reports folders and MultiQC combined report of all samples
2-Assembly All final consensus genomes assemblies ("sample_ID"_consensus_wrapped.fasta)
1-Fly_structural Nanostats results and Flye output results directories containing the graph files
2-Medaka_results Medaka output directories
3-Annotations All combined files produced by AGAT from Bakta and Prokka annotation tools are located here. Also, Bakta and Prokka output directories
3-AMR ABRICATE ABRICATE search results
AMRFinder AMRFinder search results
4-MLST MLST results
5-Plasmids MOB-suite plasmid tool output directories

References

Benchmarking reveals superiority of deep learning variant callers on bacterial Nanopore sequence data How low can you go? Short-read polishing of Oxford Nanopore bacterial genome assemblies

Evaluation of the accuracy of bacterial genome reconstruction with Oxford Nanopore R10.4.1 long-read-only sequencing

Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing

Autocylcler

About

A scalable Nextflow pipeline for automated WGS analysis, optimized for Oxford Nanopore and hybrid ONT–Illumina assemblies in clinical microbiology research.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors