NanoPulse

Production-ready Nextflow DSL2 pipeline for de novo clustering and consensus building of Oxford Nanopore amplicon sequencing data (16S, 18S, ITS, and other amplicons).

About NanoPulse

NanoPulse is a production-ready Nextflow pipeline for species-level analysis of Oxford Nanopore Technologies (ONT) amplicon sequencing data. It performs de novo clustering using UMAP/PaCMAP dimensionality reduction (switchable) and HDBSCAN clustering, followed by consensus sequence generation and taxonomic classification.

This is a modernized fork of NanoCLUST with significant enhancements:

Complete DSL2 migration - Modern Nextflow syntax and modular structure
Production-ready - 11 critical bugs fixed through real-data testing
Updated dependencies - All tools updated to latest versions (Nextflow 25.10.0+)
Comprehensive testing - 99/99 tests passing (100% coverage)
General amplicon support - 16S, 18S, ITS, and other amplicon types
Multiple classifiers - Kraken2 and BLAST support
Novel organism detection - Probabilistic classification with rescue analysis
Active maintenance - Ongoing development and bug fixes

Relationship to NanoCLUST

NanoPulse is based on the excellent NanoCLUST pipeline developed by Hector Rodriguez-Perez, Laura Ciuffreda, and Carlos Flores. We are deeply grateful for their foundational work and scientific validation.

Original Publication:

Rodríguez-Pérez H, Ciuffreda L, Flores C. NanoCLUST: a species-level analysis of 16S rRNA nanopore sequencing data. Bioinformatics. 2021;37(11):1600-1601. doi:10.1093/bioinformatics/btaa900

What NanoPulse Adds:

Nextflow DSL2 syntax (modernized from DSL1)
Critical production bug fixes (11 issues resolved)
Updated tool versions (all 38 dependencies)
Real-world data validation (5,147 ONT reads tested)
nf-core best practices implementation
Multiple classification backends (Kraken2, BLAST, FastANI)
Novel organism detection with probabilistic classification
Enhanced QC reporting (NanoPlot, MultiQC)
Phylogenetic analysis integration (optional phyloseq objects)

Pipeline Overview

The pipeline performs the following steps:

K-mer frequency calculation - Extract k-mer features from reads
UMAP/PaCMAP dimensionality reduction - Reduce k-mer space to 3D (switchable)
HDBSCAN clustering - Identify read clusters
Per-cluster assembly - Generate consensus sequences:
- Raven error correction
- FastANI draft selection
- Racon polishing (4 rounds)
- Medaka neural network polishing
Taxonomic classification - Optional classifiers:
- BLAST against NCBI databases
- Kraken2 classification
Abundance calculation - Generate abundance tables and diversity metrics
Visualization - Interactive HTML reports with UMAP plots

Quick Start

Prerequisites

Install Nextflow (≥25.10.0)

curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/  # Or add to your PATH

Install Docker or Conda

Test Run

Test the pipeline with included test data:

nextflow run FOI-Bioinformatics/NanoPulse \
    -profile test,docker \
    --outdir results_test

Running with Your Data

1. Prepare Input Samplesheet

Create a CSV file with your samples:

sample,fastq
sample1,/path/to/sample1.fastq.gz
sample2,/path/to/sample2.fastq.gz

2. Basic Run (No Classification)

nextflow run FOI-Bioinformatics/NanoPulse \
    -profile docker \
    --input samplesheet.csv \
    --outdir results \
    --enable_blast false \
    --enable_kraken2 false

3. Run with BLAST Classification

First, download a BLAST database (example for 16S rRNA):

mkdir -p db/blast db/taxdb
wget https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz
tar -xzvf 16S_ribosomal_RNA.tar.gz -C db/blast
wget https://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz
tar -xzvf taxdb.tar.gz -C db/taxdb

Then run with BLAST enabled:

nextflow run FOI-Bioinformatics/NanoPulse \
    -profile docker \
    --input samplesheet.csv \
    --outdir results \
    --enable_blast true \
    --blast_db db/blast/16S_ribosomal_RNA \
    --blast_taxdb db/taxdb

4. Run with Kraken2 Classification

nextflow run FOI-Bioinformatics/NanoPulse \
    -profile docker \
    --input samplesheet.csv \
    --outdir results \
    --enable_blast true \
    --blast_db /path/to/blast/db \
    --blast_taxdb /path/to/taxdb \
    --enable_kraken2 true \
    --kraken2_db /path/to/kraken2/db

Key Parameters

Input/Output

--input - Path to input samplesheet (CSV format)
--outdir - Output directory for results (default: ./results)

Classification Options

--enable_blast - Enable BLAST classification (default: true)
--blast_db - Path to BLAST database
--blast_taxdb - Path to BLAST taxonomy database
--enable_kraken2 - Enable Kraken2 classification (default: false)
--kraken2_db - Path to Kraken2 database

Clustering Parameters

--kmer_size - K-mer size for feature extraction (default: 9)
--umap_dimensions - UMAP output dimensions (default: 3)
--umap_neighbors - UMAP n_neighbors parameter (default: 15)
--umap_min_dist - UMAP min_dist parameter (default: 0.1)
--min_cluster_size - Minimum cluster size for HDBSCAN (default: 50)
--min_samples - Minimum samples for HDBSCAN (default: 5)

Assembly Parameters

--genome_size - Expected amplicon size (default: "1.5k")
--polishing_reads - Reads per cluster for polishing (default: 100)
--racon_rounds - Racon polishing rounds (default: 4)
--medaka_model - Medaka basecalling model (default: "r941_min_high_g303")

Resource Limits

--max_cpus - Maximum CPUs (default: 16)
--max_memory - Maximum memory (default: 128.GB)
--max_time - Maximum time (default: 240.h)

Computing Requirements

Memory Considerations

The UMAP/PaCMAP clustering step is memory-intensive:

Default settings (umap_set_size = 100,000): 32-36 GB RAM
Reduced settings (umap_set_size = 50,000): 10-13 GB RAM

If you encounter out-of-memory errors (exit status 137), reduce umap_set_size:

nextflow run FOI-Bioinformatics/NanoPulse \
    --umap_set_size 50000 \
    ...other options...

CPU Utilization

Nextflow automatically uses all available CPUs. More cores enable:

Parallel cluster processing
Faster consensus generation
Reduced overall runtime

Test Profile Requirements

Minimum for test profile:

4 CPU cores
16 GB RAM

Output Files

The pipeline generates the following key outputs in --outdir:

results/
├── consensus/
│   └── {sample}_consensus.fasta         # Final consensus sequences
├── annotations/
│   └── {sample}_annotations.tsv         # Taxonomic annotations
├── abundances/
│   └── {sample}_abundances.csv          # Cluster abundances
├── diversity/
│   └── {sample}_diversity.txt           # Diversity metrics
├── plots/
│   └── {sample}_dimreduction_plot.png          # UMAP visualization
├── html_reports/
│   └── {sample}_report.html            # Interactive HTML report
├── multiqc/
│   └── multiqc_report.html             # MultiQC report (if enabled)
└── pipeline_info/
    ├── execution_report.html           # Nextflow execution report
    ├── execution_timeline.html         # Execution timeline
    └── execution_trace.txt             # Resource usage trace

Profiles

Execution Profiles

docker - Use Docker containers (recommended)
singularity - Use Singularity containers
conda - Use Conda environments

Test Profiles

test - Minimal test dataset
test_full - Full-size test dataset

Example Usage

# Docker with test data
nextflow run . -profile test,docker

# Singularity on HPC
nextflow run . -profile docker --input data.csv --outdir results

# Conda environment
nextflow run . -profile conda --input data.csv --outdir results

Troubleshooting

Docker Permission Issues

If you encounter Docker permission errors, add your user to the docker group:

sudo usermod -aG docker $USER
# Log out and back in for changes to take effect

Memory Issues

If processes fail with exit status 137 (out of memory):

Reduce umap_set_size:
```
--umap_set_size 50000
```
Reduce cluster size threshold:
```
--min_cluster_size 30
```
Limit resources explicitly:
```
--max_memory '32.GB' --max_cpus 8
```

Conda Environment Issues

If you experience issues with Conda profiles, try:

Use Docker profile instead (recommended)
Clear Conda cache:
```
conda clean --all
```
Use mamba for faster dependency resolution:
```
conda install mamba -c conda-forge
```

Resume Failed Runs

Nextflow can resume interrupted runs:

nextflow run . -profile docker --input data.csv -resume

Citations and Credits

NanoPulse Development

Maintainer: FOI-Bioinformatics Team Repository: https://github.com/FOI-Bioinformatics/NanoPulse License: MIT License

Original NanoCLUST Development

Original Authors: Hector Rodriguez-Perez, Laura Ciuffreda, Carlos Flores Original Repository: https://github.com/genomicsITER/NanoCLUST Institution: Instituto Tecnológico y de Energías Renovables (ITER), Canary Islands, Spain

Publication:

Rodríguez-Pérez H, Ciuffreda L, Flores C. NanoCLUST: a species-level analysis of 16S rRNA nanopore sequencing data. Bioinformatics. 2021;37(11):1600-1601. doi:10.1093/bioinformatics/btaa900

Funding (Original NanoCLUST): This work was supported by Instituto de Salud Carlos III [PI14/00844, PI17/00610, and FI18/00230] and co-financed by the European Regional Development Funds, "A way of making Europe" from the European Union; Ministerio de Ciencia e Innovación [RTC-2017-6471-1, AEI/FEDER, UE]; Cabildo Insular de Tenerife [CGIEU0000219140]; Fundación Canaria Instituto de Investigación Sanitaria de Canarias [PIFUN48/18]; and by the agreement with Instituto Tecnológico y de Energías Renovables (ITER) to strengthen scientific and technological education, training, research, development and innovation in Genomics, Personalized Medicine and Biotechnology [OA17/008].

Contributions and Support

We welcome contributions to NanoPulse! Please see the contributing guidelines for details.

To report issues or request features, please use the GitHub issue tracker.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Both NanoCLUST and NanoPulse are MIT licensed, allowing free use, modification, and distribution with proper attribution.

Acknowledgments

We acknowledge and thank:

The original NanoCLUST developers for their pioneering work
The Nextflow community for excellent workflow tools
The nf-core community for best practices and modules
All contributors to the open-source tools used in this pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
assets		assets
bin		bin
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
templates		templates
test_datasets		test_datasets
tests		tests
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.cff		CITATIONS.cff
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
TESTING_GUIDE.md		TESTING_GUIDE.md
main.nf		main.nf
mkdocs.yml		mkdocs.yml
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config

License

FOI-Bioinformatics/NanoPulse

Folders and files

Latest commit

History

Repository files navigation

NanoPulse

About NanoPulse

Relationship to NanoCLUST

Pipeline Overview

Quick Start

Prerequisites

Test Run

Running with Your Data

1. Prepare Input Samplesheet

2. Basic Run (No Classification)

3. Run with BLAST Classification

4. Run with Kraken2 Classification

Key Parameters

Input/Output

Classification Options

Clustering Parameters

Assembly Parameters

Resource Limits

Computing Requirements

Memory Considerations

CPU Utilization

Test Profile Requirements

Output Files

Profiles

Execution Profiles

Test Profiles

Example Usage

Troubleshooting

Docker Permission Issues

Memory Issues

Conda Environment Issues

Resume Failed Runs

Citations and Credits

NanoPulse Development

Original NanoCLUST Development

Contributions and Support

License

Acknowledgments

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages