This Nextflow pipeline provides an automated, reproducible and scalable solution for whole-genome sequencing (WGS) analysis in clinical microbiology research optimised for Oxford Nanopore Technology (ONT) data. It also supports Illumina data for the novo hybrid assemblies.
All modes in the pipeline include the following steps:
-
Long reads QC and trimming: Assessment of read quality before and after filtering using Nanoplot. Filtering of low-quality bases and short reads is performed using Filtlong followed by removal of ONT adapter sequences using Porechop. All Nanoplot reports are summarised using Nanocomp.
-
Contaminant sequence removal: The taxonomic sequence classifier Kraken2 is used to identify contaminant non-bacterial reads followed by SEQTK to filter out all reads flagged as contaminants.
From this point onwards two modes are available: If only ONT data is available --mode assemble; if both ONT and Illumina data are available, you should select --mode hybrid to perform a hybrid assembly.
-
Assembly: De novo assembly using the single-molecule assembler Flye followed by multiple rounds of polishing and the construction of a consensus sequence using Medaka. Genome assemblies are then reoriented using dnaapler.
-
Polishing process: The optimal number of polishing rounds is determined automatically using the CART algorithm. The prediction is based on multiple parameters, which include error rate, N50/L50, genome coverage, Total Length of Matches, Average Occurrences, Distinct Minimizers, and processing time per round.
Source Parameter Description Minimap2 DistinctMinimizers Number of unique minimizers found (Minimap2 value),change < 0.1% in distinct minimizers AverageOccurrences Average occurrences of minimizers (Minimap2),change < 0.01 in average occurrences TotalLengthMatches Total length of aligned matches,change < 0.1% ProcessingTime Total execution time per round (Racon or Minimap2), change < 5% RACON Processing Time Change < 5% QUAST N50/L50 Minimum contig length that covers 50% of the assembly, change < 100 bp QUAST/MEDAKA ErrorRate Error rate in the sequence after each polishing round BUSCO Completeness (BUSCO) Change < 1% in complete genes Target Value Optional Rounds Optimal number of rounds needed to achieve convergence
-
-
Assembly: The Autocycler tool is used to generate a consensus de novo long-read assembly by combining multiple alternative assemblies produced by different assemblers (e.g. Canu, Flye, NextDenovo, etc.). Afterwards, the consensus long-read genome is reoriented using dnaapler and then polished with the short Illumina reads following these steps:
-
Short reads QC and trimming: Trimming and filtering of low-quality bases and short reads are performed with Fastp. Short read quality is assessed before and after trimming using FastQC, and summarised using MultiQC.
-
Mapping and polishing: Short reads are mapped to the consensus genome assembly using BWA-MEM, followed by a filtering and polishing step to improve the assembly using Polipolish.
-
After the consensus genome assemblies have been generated, all assemblies are processed using the same workflow:
-
Assembly and genome QC: Structural quality metrics are evaluated with QUAST, and genome completeness is assessed using BUSCO. A final combined report is generated with MultiQC.
-
Annotation: Genome annotation is performed using both Prokka and Bakta. The resulting GFF annotation files from both annotation tools are cleaned and combined using AGAT.
-
Post-assembly analyses:
- Mass screening of contigs for antimicrobial resistance and virulence genes using ABRIcate.
- Identification of antimicrobial resistance genes and point mutations in protein and/or assembled nucleotide sequences using AMRFinder.
- Screening of genomes against traditional PubMLST schemes using MLST.
- Plasmid analysis: In case --plasmid option is added in the command line, the mob-suite tool is used to predict and identify the plasmid sequences from the assemblies.
Prerequisites to run the pipeline:
- Install Nextflow (Ver. ≥ 25.10.0).
- Install Docker or Singularity for container support.
- Ensure that Java 8 or a more recent version is installed.
Clone the Repository:
# Clone the workflow repository
git clone https://github.com/AMRmicrobiology/ONT_BACTERIAL_ANALYSIS.git
# Move inside the main directory
cd ONT_BACTERIAL_ANALYSIS
If you are running the pipeline locally, remember to define the path for Singularity temporary files and cache:
SINGULARITY_TMPDIR=/PATH/singularity/tmp
SINGULARITY_CACHEDIR=/PATH/singularity/cache
TMPDIR=/PATH/singularity/tmp
export NFX_SINGULARITY_CACHEDIR =/PATH/singularity/tmp
e.g:
SINGULARITY_TMPDIR=/mnt/dades/singularity/tmp
SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp
TMPDIR=/mnt/dades/singularity/tmp
export NFX_SINGULARITY_CACHEDIR=/mnt/dades/singularity/tmp
export APPTAINER_TMPDIR=/mnt/dades/singularity/tmp
export APPTAINER_CACHEDIR=/mnt/dades/singularity/cache
export NXF_SINGULARITY_CACHEDIR=/mnt/dades/singularity/cache
export APPTAINERENV_NXF_TASK_WORKDIR=/mnt/dades/singularity/tmp
export APPTAINERENV_TMPDIR=/mnt/dades/singularity/tmp
Note
Conda environments are listed and created but have not been tested.
Inside the ONT_BACTERIAL_ANALYSIS directory, modify the file barcode_info.csv to add the expected genome size (bp) and sample code you want to assign to each barcode:
Important
The sample code names should not include "-".
e.g.
barcode,genome_size,sample_code
barcode01,3000000,306
barcode02,4500000,C2_72
barcode03,5000000,C2_75
barcode04,4000000,C2_76
barcode05,3200000,ST89
barcode06,3500000,ST23
Run the pipeline using the following command, adjusting the parameters as needed:
ASSEMBLE
nextflow run main.nf --mode assemble --genome_size_file barcode_info.csv --input '/path/to/data/barcode*' -profile <docker/singularity/conda>
HYBRID
nextflow run main.nf --mode hybrid --genome_size_file barcode_info.csv --input '/path/to/data/barcode*' --short_reads '/path/to/data/*_{1,2}.fastq.gz' -profile <docker/singularity/conda>
Usage: nextflow run main.nf [--help] [--mode VAR] [--genome_size_file VAR] [--input VAR] [--short_reads VAR] [--outdir VAR] [--organism VAR] [--min_length VAR] [--min_mean_q VAR] [--keep_percent VAR] [--plasmid] [--bakta_db_define VAR] [--db_select VAR] [--abricate_db VAR] [-w VAR] [-profile VAR]
Input data arguments
--mode TEXT Selection of the pipeline assemble/hybrid [required]
--input PATH Input barcode* folder(s) containing the long reads .fastq.gz files [required]
--genome_size_file PATH Path to the .csv file with barcode, size and sample name information [required]
Pipeline specific
--short_reads PATH (--mode hybrid) Input FASTQ paired-end files named *_{1,2} (.fastq.gz format) [required]
Nextflow arguments
-profile TEXT Selection of execution profile (docker, singularity or conda) [required]
-w PATH Path to the work dir. where temporary files will be written [default: ./work ]
Output arguments
--outdir PATH Directory to write the output [default: ./out]
Optional arguments
--help Show this message and exit
--plasmid BOOLEAN Add this parameter to identify and type plasmid sequences in your assembly [default: false]
Long-read filtering arguments
--min_length INTEGER Minimum length threshold (bp) [default: 1000]
--min_mean_q INTEGER Minimum mean quality threshold [default: 10]
--keep_percent INTEGER Throw out the worst (100-x)% of read bases [default: 90].
AMR arguments
--organism TEXT By default, ABRicate searches the following databases: vfdb_full, resfinder, plasmidfinder, and card. If Escherichia coli or Klebsiella pneumoniae is specified, ecoli_vf and argannot will be searched, respectively, instead of vfdb_full [default: ""].
Databases arguments
--bakta_db_define PATH Define the path to the user downloaded database to be used by Bakta. By default the database is downloaded if no argument is added. Another option is to copy-paste the database directly to the "./bakta_db" directory
--db_select TEXT/PATH Kraken2 database to use for taxonomy classification. The options "db_16GB" or "db_full_60GB" are downloaded automatically if specified. Alternatively, a path to a user-provided database may be supplied. Another option is to copy-paste the database directly into the "./kraken_db" directory [default: "db_16GB"]
--abricate_db PATH Path to the user downloaded databases to be used by Abricate
This is the forder architecture and the content of the output data directory:
| Folder | Subfolder | Description |
|---|---|---|
| 1-QC | data_QC | Individual initial Nanoplot results folders and Nanocomp summary of all reports before and after trimming |
| genome_QC | Individual BUSCO and QUAST reports folders and MultiQC combined report of all samples | |
| 2-Assembly | All final consensus genomes assemblies ("sample_ID"_consensus_wrapped.fasta) | |
| 1-Fly_structural | Nanostats results and Flye output results directories containing the graph files | |
| 2-Medaka_results | Medaka output directories | |
| 3-Annotations | All combined files produced by AGAT from Bakta and Prokka annotation tools are located here. Also, Bakta and Prokka output directories | |
| 3-AMR | ABRICATE | ABRICATE search results |
| AMRFinder | AMRFinder search results | |
| 4-MLST | MLST results | |
| 5-Plasmids | MOB-suite plasmid tool output directories |
Benchmarking reveals superiority of deep learning variant callers on bacterial Nanopore sequence data How low can you go? Short-read polishing of Oxford Nanopore bacterial genome assemblies
Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing