ONT_BACTERIAL_ANALYSIS

Introduction

This Nextflow pipeline provides an automated, reproducible and scalable solution for whole-genome sequencing (WGS) analysis in clinical microbiology research optimised for Oxford Nanopore Technology (ONT) data. It also supports Illumina data for the novo hybrid assemblies.

Pipeline summary

All modes in the pipeline include the following steps:

Long reads QC and trimming: Assessment of read quality before and after filtering using Nanoplot. Filtering of low-quality bases and short reads is performed using Filtlong followed by removal of ONT adapter sequences using Porechop. All Nanoplot reports are summarised using Nanocomp.
Contaminant sequence removal: The taxonomic sequence classifier Kraken2 is used to identify contaminant non-bacterial reads followed by SEQTK to filter out all reads flagged as contaminants.

From this point onwards two modes are available: If only ONT data is available --mode assemble; if both ONT and Illumina data are available, you should select --mode hybrid to perform a hybrid assembly.

mode --assemble

Assembly: De novo assembly using the single-molecule assembler Flye followed by multiple rounds of polishing and the construction of a consensus sequence using Medaka. Genome assemblies are then reoriented using dnaapler.

Polishing process: The optimal number of polishing rounds is determined automatically using the CART algorithm. The prediction is based on multiple parameters, which include error rate, N50/L50, genome coverage, Total Length of Matches, Average Occurrences, Distinct Minimizers, and processing time per round.

Source	Parameter	Description
Minimap2	DistinctMinimizers	Number of unique minimizers found (Minimap2 value),change < 0.1% in distinct minimizers
	AverageOccurrences	Average occurrences of minimizers (Minimap2),change < 0.01 in average occurrences
	TotalLengthMatches	Total length of aligned matches,change < 0.1%
	ProcessingTime	Total execution time per round (Racon or Minimap2), change < 5%
RACON	Processing Time	Change < 5%
QUAST	N50/L50	Minimum contig length that covers 50% of the assembly, change < 100 bp
QUAST/MEDAKA	ErrorRate	Error rate in the sequence after each polishing round
BUSCO	Completeness (BUSCO)	Change < 1% in complete genes
Target Value	Optional Rounds	Optimal number of rounds needed to achieve convergence

mode --hybrid

Assembly: The Autocycler tool is used to generate a consensus de novo long-read assembly by combining multiple alternative assemblies produced by different assemblers (e.g. Canu, Flye, NextDenovo, etc.). Afterwards, the consensus long-read genome is reoriented using dnaapler and then polished with the short Illumina reads following these steps:
- Short reads QC and trimming: Trimming and filtering of low-quality bases and short reads are performed with Fastp. Short read quality is assessed before and after trimming using FastQC, and summarised using MultiQC.
- Mapping and polishing: Short reads are mapped to the consensus genome assembly using BWA-MEM, followed by a filtering and polishing step to improve the assembly using Polipolish.

After the consensus genome assemblies have been generated, all assemblies are processed using the same workflow:

Assembly and genome QC: Structural quality metrics are evaluated with QUAST, and genome completeness is assessed using BUSCO. A final combined report is generated with MultiQC.
Annotation: Genome annotation is performed using both Prokka and Bakta. The resulting GFF annotation files from both annotation tools are cleaned and combined using AGAT.
Post-assembly analyses:
- Mass screening of contigs for antimicrobial resistance and virulence genes using ABRIcate.
- Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.
- Screening of genomes against traditional PubMLST schemes using MLST.
- Plasmid analysis: In case --plasmid option is added in the command line, the mob-suite tool is used to predict and identify the plasmid sequences from the assemblies.

Installation

Prerequisites to run the pipeline:

Install Nextflow (Ver. ≥ 25.10.0).
Install Docker or Singularity for container support.
Ensure that Java 8 or a more recent version is installed.

Clone the Repository:

# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/ONT_BACTERIAL_ANALYSIS.git

# Move inside the main directory
cd ONT_BACTERIAL_ANALYSIS

Local (Singularity)

If you are running the pipeline locally, remember to define the path for Singularity temporary files and cache:

SINGULARITY_TMPDIR=/PATH/singularity/tmp
SINGULARITY_CACHEDIR=/PATH/singularity/cache
TMPDIR=/PATH/singularity/tmp
export NFX_SINGULARITY_CACHEDIR =/PATH/singularity/tmp

e.g:

SINGULARITY_TMPDIR=/mnt/dades/singularity/tmp
SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp
TMPDIR=/mnt/dades/singularity/tmp
export NFX_SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp

export APPTAINER_TMPDIR=/mnt/dades/singularity/tmp
export APPTAINER_CACHEDIR=/mnt/dades/singularity/cache
export NXF_SINGULARITY_CACHEDIR=/mnt/dades/singularity/cache
export APPTAINERENV_NXF_TASK_WORKDIR=/mnt/dades/singularity/tmp
export APPTAINERENV_TMPDIR=/mnt/dades/singularity/tmp

Note

Conda environments are listed and created but have not been tested.

How to use it?

Inside the ONT_BACTERIAL_ANALYSIS directory, modify the file barcode_info.csv to add the expected genome size (bp) and sample code you want to assign to each barcode:

Important

The sample code names should not include "-".

e.g.

barcode,genome_size,sample_code
barcode01,3000000,306 
barcode02,4500000,C2_72
barcode03,5000000,C2_75
barcode04,4000000,C2_76
barcode05,3200000,ST89
barcode06,3500000,ST23

Run the pipeline using the following command, adjusting the parameters as needed:

ASSEMBLE

nextflow run main.nf --mode assemble --genome_size_file barcode_info.csv --input '/path/to/data/barcode*' -profile <docker/singularity/conda>

HYBRID

nextflow run main.nf --mode hybrid --genome_size_file barcode_info.csv --input '/path/to/data/barcode*' --short_reads '/path/to/data/*_{1,2}.fastq.gz' -profile <docker/singularity/conda>

Usage and parameters

Usage: nextflow run main.nf [--help] [--mode VAR] [--genome_size_file VAR] [--input VAR] [--short_reads VAR] [--outdir VAR] [--organism VAR] [--min_length VAR] [--min_mean_q VAR] [--keep_percent VAR] [--plasmid] [--bakta_db_define VAR] [--db_select VAR] [--abricate_db VAR] [-w VAR] [-profile VAR]
  
Input data arguments
  --mode             TEXT        Selection of the pipeline assemble/hybrid [required]
  --input            PATH        Input barcode* folder(s) containing the long reads .fastq.gz files [required]
  --genome_size_file PATH        Path to the .csv file with barcode, size and sample name information [required]
  Pipeline specific
  --short_reads     PATH        (--mode hybrid) Input FASTQ paired-end files named *_{1,2} (.fastq.gz format) [required]
  
Nextflow arguments
  -profile           TEXT        Selection of execution profile (docker, singularity or conda) [required]
  -w                 PATH        Path to the work dir. where temporary files will be written [default: ./work ]
  
Output arguments
  --outdir           PATH        Directory to write the output [default: ./out]
  
Optional arguments 
  --help                         Show this message and exit      
  --plasmid          BOOLEAN     Add this parameter to identify and type plasmid sequences in your assembly [default: false]
  
Long-read filtering arguments
  --min_length       INTEGER     Minimum length threshold (bp) [default: 1000]
  --min_mean_q       INTEGER     Minimum mean quality threshold [default: 10]
  --keep_percent     INTEGER     Throw out the worst (100-x)% of read bases [default: 90].
  
AMR arguments
  --organism         TEXT        By default, ABRicate searches the following databases: vfdb_full, resfinder, plasmidfinder, and card. If Escherichia coli or Klebsiella pneumoniae is specified, ecoli_vf and argannot will be searched, respectively, instead of vfdb_full [default: ""]. 
  
Databases arguments
  --bakta_db_define  PATH        Define the path to the user downloaded database to be used by Bakta. By default the database is downloaded if no argument is added. Another option is to copy-paste the database directly to the "./bakta_db" directory
  --db_select        TEXT/PATH   Kraken2 database to use for taxonomy classification. The options "db_16GB" or "db_full_60GB" are downloaded automatically if specified. Alternatively, a path to a user-provided database may be supplied. Another option is to copy-paste the database directly into the "./kraken_db" directory [default: "db_16GB"]
  --abricate_db      PATH        Path to the user downloaded databases to be used by Abricate

Output

This is the forder architecture and the content of the output data directory:

Folder	Subfolder	Description
1-QC	data_QC	Individual initial Nanoplot results folders and Nanocomp summary of all reports before and after trimming
1-QC	genome_QC	Individual BUSCO and QUAST reports folders and MultiQC combined report of all samples
2-Assembly		All final consensus genomes assemblies ("sample_ID"_consensus_wrapped.fasta)
	1-Fly_structural	Nanostats results and Flye output results directories containing the graph files
	2-Medaka_results	Medaka output directories
	3-Annotations	All combined files produced by AGAT from Bakta and Prokka annotation tools are located here. Also, Bakta and Prokka output directories
3-AMR	ABRICATE	ABRICATE search results
3-AMR	AMRFinder	AMRFinder search results
4-MLST		MLST results
5-Plasmids		MOB-suite plasmid tool output directories

References

Benchmarking reveals superiority of deep learning variant callers on bacterial Nanopore sequence data How low can you go? Short-read polishing of Oxford Nanopore bacterial genome assemblies

Evaluation of the accuracy of bacterial genome reconstruction with Oxford Nanopore R10.4.1 long-read-only sequencing

Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing

Autocylcler

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
bakta_db		bakta_db
bin		bin
data		data
envs		envs
kraken_db		kraken_db
subworkflow		subworkflow
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.agat		Dockerfile.agat
Dockerfile.autocycler		Dockerfile.autocycler
Dockerfile.bakta		Dockerfile.bakta
Dockerfile.baktaDB		Dockerfile.baktaDB
Dockerfile.busco		Dockerfile.busco
Dockerfile.enrichment		Dockerfile.enrichment
Dockerfile.kraken2.DB		Dockerfile.kraken2.DB
LICENSE		LICENSE
README.md		README.md
barcode_info.csv		barcode_info.csv
kraken2-entrypoint.sh		kraken2-entrypoint.sh
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ONT_BACTERIAL_ANALYSIS

Introduction

Contents

Pipeline summary

mode --assemble

mode --hybrid

Installation

Local (Singularity)

How to use it?

Usage and parameters

Output

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

AMRmicrobiology/ONT_BACTERIAL_ANALYSIS

Folders and files

Latest commit

History

Repository files navigation

ONT_BACTERIAL_ANALYSIS

Introduction

Contents

Pipeline summary

mode --assemble

mode --hybrid

Installation

Local (Singularity)

How to use it?

Usage and parameters

Output

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages