Production-ready Nextflow DSL2 pipeline for de novo clustering and consensus building of Oxford Nanopore amplicon sequencing data (16S, 18S, ITS, and other amplicons).
NanoPulse is a production-ready Nextflow pipeline for species-level analysis of Oxford Nanopore Technologies (ONT) amplicon sequencing data. It performs de novo clustering using UMAP/PaCMAP dimensionality reduction (switchable) and HDBSCAN clustering, followed by consensus sequence generation and taxonomic classification.
This is a modernized fork of NanoCLUST with significant enhancements:
- Complete DSL2 migration - Modern Nextflow syntax and modular structure
- Production-ready - 11 critical bugs fixed through real-data testing
- Updated dependencies - All tools updated to latest versions (Nextflow 25.10.0+)
- Comprehensive testing - 99/99 tests passing (100% coverage)
- General amplicon support - 16S, 18S, ITS, and other amplicon types
- Multiple classifiers - Kraken2 and BLAST support
- Novel organism detection - Probabilistic classification with rescue analysis
- Active maintenance - Ongoing development and bug fixes
NanoPulse is based on the excellent NanoCLUST pipeline developed by Hector Rodriguez-Perez, Laura Ciuffreda, and Carlos Flores. We are deeply grateful for their foundational work and scientific validation.
Original Publication:
Rodríguez-Pérez H, Ciuffreda L, Flores C. NanoCLUST: a species-level analysis of 16S rRNA nanopore sequencing data. Bioinformatics. 2021;37(11):1600-1601. doi:10.1093/bioinformatics/btaa900
What NanoPulse Adds:
- Nextflow DSL2 syntax (modernized from DSL1)
- Critical production bug fixes (11 issues resolved)
- Updated tool versions (all 38 dependencies)
- Real-world data validation (5,147 ONT reads tested)
- nf-core best practices implementation
- Multiple classification backends (Kraken2, BLAST, FastANI)
- Novel organism detection with probabilistic classification
- Enhanced QC reporting (NanoPlot, MultiQC)
- Phylogenetic analysis integration (optional phyloseq objects)
The pipeline performs the following steps:
- K-mer frequency calculation - Extract k-mer features from reads
- UMAP/PaCMAP dimensionality reduction - Reduce k-mer space to 3D (switchable)
- HDBSCAN clustering - Identify read clusters
- Per-cluster assembly - Generate consensus sequences:
- Raven error correction
- FastANI draft selection
- Racon polishing (4 rounds)
- Medaka neural network polishing
- Taxonomic classification - Optional classifiers:
- BLAST against NCBI databases
- Kraken2 classification
- Abundance calculation - Generate abundance tables and diversity metrics
- Visualization - Interactive HTML reports with UMAP plots
- Install Nextflow (≥25.10.0)
curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/ # Or add to your PATHTest the pipeline with included test data:
nextflow run FOI-Bioinformatics/NanoPulse \
-profile test,docker \
--outdir results_testCreate a CSV file with your samples:
sample,fastq
sample1,/path/to/sample1.fastq.gz
sample2,/path/to/sample2.fastq.gz
nextflow run FOI-Bioinformatics/NanoPulse \
-profile docker \
--input samplesheet.csv \
--outdir results \
--enable_blast false \
--enable_kraken2 falseFirst, download a BLAST database (example for 16S rRNA):
mkdir -p db/blast db/taxdb
wget https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz
tar -xzvf 16S_ribosomal_RNA.tar.gz -C db/blast
wget https://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz
tar -xzvf taxdb.tar.gz -C db/taxdbThen run with BLAST enabled:
nextflow run FOI-Bioinformatics/NanoPulse \
-profile docker \
--input samplesheet.csv \
--outdir results \
--enable_blast true \
--blast_db db/blast/16S_ribosomal_RNA \
--blast_taxdb db/taxdbnextflow run FOI-Bioinformatics/NanoPulse \
-profile docker \
--input samplesheet.csv \
--outdir results \
--enable_blast true \
--blast_db /path/to/blast/db \
--blast_taxdb /path/to/taxdb \
--enable_kraken2 true \
--kraken2_db /path/to/kraken2/db--input- Path to input samplesheet (CSV format)--outdir- Output directory for results (default:./results)
--enable_blast- Enable BLAST classification (default:true)--blast_db- Path to BLAST database--blast_taxdb- Path to BLAST taxonomy database--enable_kraken2- Enable Kraken2 classification (default:false)--kraken2_db- Path to Kraken2 database
--kmer_size- K-mer size for feature extraction (default:9)--umap_dimensions- UMAP output dimensions (default:3)--umap_neighbors- UMAP n_neighbors parameter (default:15)--umap_min_dist- UMAP min_dist parameter (default:0.1)--min_cluster_size- Minimum cluster size for HDBSCAN (default:50)--min_samples- Minimum samples for HDBSCAN (default:5)
--genome_size- Expected amplicon size (default:"1.5k")--polishing_reads- Reads per cluster for polishing (default:100)--racon_rounds- Racon polishing rounds (default:4)--medaka_model- Medaka basecalling model (default:"r941_min_high_g303")
--max_cpus- Maximum CPUs (default:16)--max_memory- Maximum memory (default:128.GB)--max_time- Maximum time (default:240.h)
The UMAP/PaCMAP clustering step is memory-intensive:
- Default settings (umap_set_size = 100,000): 32-36 GB RAM
- Reduced settings (umap_set_size = 50,000): 10-13 GB RAM
If you encounter out-of-memory errors (exit status 137), reduce umap_set_size:
nextflow run FOI-Bioinformatics/NanoPulse \
--umap_set_size 50000 \
...other options...Nextflow automatically uses all available CPUs. More cores enable:
- Parallel cluster processing
- Faster consensus generation
- Reduced overall runtime
Minimum for test profile:
- 4 CPU cores
- 16 GB RAM
The pipeline generates the following key outputs in --outdir:
results/
├── consensus/
│ └── {sample}_consensus.fasta # Final consensus sequences
├── annotations/
│ └── {sample}_annotations.tsv # Taxonomic annotations
├── abundances/
│ └── {sample}_abundances.csv # Cluster abundances
├── diversity/
│ └── {sample}_diversity.txt # Diversity metrics
├── plots/
│ └── {sample}_dimreduction_plot.png # UMAP visualization
├── html_reports/
│ └── {sample}_report.html # Interactive HTML report
├── multiqc/
│ └── multiqc_report.html # MultiQC report (if enabled)
└── pipeline_info/
├── execution_report.html # Nextflow execution report
├── execution_timeline.html # Execution timeline
└── execution_trace.txt # Resource usage trace
docker- Use Docker containers (recommended)singularity- Use Singularity containersconda- Use Conda environments
test- Minimal test datasettest_full- Full-size test dataset
# Docker with test data
nextflow run . -profile test,docker
# Singularity on HPC
nextflow run . -profile docker --input data.csv --outdir results
# Conda environment
nextflow run . -profile conda --input data.csv --outdir resultsIf you encounter Docker permission errors, add your user to the docker group:
sudo usermod -aG docker $USER
# Log out and back in for changes to take effectIf processes fail with exit status 137 (out of memory):
-
Reduce
umap_set_size:--umap_set_size 50000
-
Reduce cluster size threshold:
--min_cluster_size 30
-
Limit resources explicitly:
--max_memory '32.GB' --max_cpus 8
If you experience issues with Conda profiles, try:
- Use Docker profile instead (recommended)
- Clear Conda cache:
conda clean --all
- Use mamba for faster dependency resolution:
conda install mamba -c conda-forge
Nextflow can resume interrupted runs:
nextflow run . -profile docker --input data.csv -resumeMaintainer: FOI-Bioinformatics Team Repository: https://github.com/FOI-Bioinformatics/NanoPulse License: MIT License
Original Authors: Hector Rodriguez-Perez, Laura Ciuffreda, Carlos Flores Original Repository: https://github.com/genomicsITER/NanoCLUST Institution: Instituto Tecnológico y de Energías Renovables (ITER), Canary Islands, Spain
Publication:
Rodríguez-Pérez H, Ciuffreda L, Flores C. NanoCLUST: a species-level analysis of 16S rRNA nanopore sequencing data. Bioinformatics. 2021;37(11):1600-1601. doi:10.1093/bioinformatics/btaa900
Funding (Original NanoCLUST): This work was supported by Instituto de Salud Carlos III [PI14/00844, PI17/00610, and FI18/00230] and co-financed by the European Regional Development Funds, "A way of making Europe" from the European Union; Ministerio de Ciencia e Innovación [RTC-2017-6471-1, AEI/FEDER, UE]; Cabildo Insular de Tenerife [CGIEU0000219140]; Fundación Canaria Instituto de Investigación Sanitaria de Canarias [PIFUN48/18]; and by the agreement with Instituto Tecnológico y de Energías Renovables (ITER) to strengthen scientific and technological education, training, research, development and innovation in Genomics, Personalized Medicine and Biotechnology [OA17/008].
We welcome contributions to NanoPulse! Please see the contributing guidelines for details.
To report issues or request features, please use the GitHub issue tracker.
This project is licensed under the MIT License - see the LICENSE file for details.
Both NanoCLUST and NanoPulse are MIT licensed, allowing free use, modification, and distribution with proper attribution.
We acknowledge and thank:
- The original NanoCLUST developers for their pioneering work
- The Nextflow community for excellent workflow tools
- The nf-core community for best practices and modules
- All contributors to the open-source tools used in this pipeline