A Python CLI tool for dereplicating and filtering viral contigs (vOTUs - viral Operational Taxonomic Units) using the CheckV method.
A small toolkit developed for the EBAME workshop with subcommands:
- derep: Remove redundant viral sequences using BLAST-based ANI clustering
- filter: Filter viral contigs based on quality, completeness, and other metrics from CheckV tsv output
- tabulate: Generate CSV tables from paired-end sequencing read directories (for nextflow)
- trainingdata: Fetch viral assembly datasets for training purposes
- getdbs: Download Genomad and CheckV databases
- splitcoverm: Split a CoverM TSV by metric into separate TSVs, one per metric.
- Python >= 3.10
- BLAST+ toolkit (specifically
blastnandmakeblastdb)
# Clone the repository
git clone https://github.com/yourusername/votuderep.git
cd votuderep
# Install in development mode
pip install -e .
# Or install normally
pip install .votuderep requires BLAST+ to be installed and available in your PATH:
# Using conda (recommended)
conda install -c bioconda blast
# On Ubuntu/Debian
sudo apt-get install ncbi-blast+
# On macOS
brew install blastvotuderep provides subcommands: derep, filter, tabulate, trainingdata, and splitcoverm.
Remove redundant sequences using BLAST and ANI clustering:
votuderep derep -i input.fasta -o dereplicated.fastaOptions:
-i, --input: Input FASTA file [required]-o, --output: Output FASTA file [default: dereplicated_vOTUs.fasta]-t, --threads: Number of threads for BLAST [default: 2]--tmp: Temporary directory [default: $TEMP or /tmp]--min-ani: Minimum ANI threshold (0-100) [default: 95]--min-tcov: Minimum target coverage (0-100) [default: 85]--keep: Keep temporary directory with intermediate files
Example:
# Basic dereplication
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta
# With custom parameters
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta \
--min-ani 97 --min-tcov 90 -t 8
# Keep intermediate files for inspection
votuderep derep -i viral_contigs.fasta -o dereplicated.fasta \
--keep --tmp ./temp_dirFilter viral contigs based on CheckV quality metrics:
votuderep filter input.fasta checkv_output.tsv -o filtered.fastaRequired Arguments:
FASTA: Input FASTA file with viral contigsCHECKV_OUT: TSV output file from CheckV
Options:
Length filters:
-m, --min-len: Minimum contig length [default: 0]--max-len: Maximum contig length, 0 = unlimited [default: 0]
Quality filters:
--min-quality: Minimum quality level: low, medium, or high [default: low]--complete: Only keep complete genomes--exclude-undetermined: Exclude contigs where quality is "Not-determined"
Metrics filters:
-c, --min-completeness: Minimum completeness percentage (0-100)--max-contam: Maximum contamination percentage (0-100)--no-warnings: Only keep contigs with no warnings
Other filters:
--provirus: Only select proviruses (provirus == "Yes")-o, --output: Output FASTA file [default: STDOUT]
Examples:
# Basic filtering - minimum quality
votuderep filter viral_contigs.fasta checkv_output.tsv -o filtered.fasta
# High-quality sequences only
votuderep filter viral_contigs.fasta checkv_output.tsv \
--min-quality high -o high_quality.fasta
# Complete genomes with minimum length
votuderep filter viral_contigs.fasta checkv_output.tsv \
--complete --min-len 5000 -o complete_genomes.fasta
# Complex filtering
votuderep filter viral_contigs.fasta checkv_output.tsv \
--min-quality medium \
--min-completeness 80 \
--max-contam 5 \
--no-warnings \
--min-len 3000 \
-o high_confidence.fasta
# Output to stdout (for piping)
votuderep filter viral_contigs.fasta checkv_output.tsv > filtered.fastaQuality Levels:
CheckV assigns quality levels to viral contigs:
- Complete: Complete genomes (highest quality)
- High-quality: High confidence viral sequences
- Medium-quality: Moderate confidence sequences
- Low-quality: Lower confidence but valid sequences
- Not-determined: Quality could not be determined
The --min-quality option filters inclusively:
low: Includes Low, Medium, High, and Complete (default)medium: Includes Medium, High, and Completehigh: Includes High and Complete only
Note: "Not-determined" sequences are included by default unless --exclude-undetermined is used.
Generate a CSV table from a directory containing paired-end sequencing reads:
votuderep tabulate reads/ -o samples.csvRequired Arguments:
INPUT_DIR: Directory containing sequencing read files
Options:
-o, --output: Output CSV file [default: STDOUT]-d, --delimiter: Field separator [default: ,]-1, --for-tag: Forward read identifier [default: _R1]-2, --rev-tag: Reverse read identifier [default: _R2]-s, --strip: Remove string from sample names (can be used multiple times)-e, --extension: Only process files with this extension-a, --absolute: Use absolute paths in output
Examples:
# Basic usage - generate CSV table
votuderep tabulate reads/ -o samples.csv
# Custom read tags and extension
votuderep tabulate reads/ --for-tag _1 --rev-tag _2 --extension .fq.gz
# Strip patterns from sample names and use absolute paths
votuderep tabulate reads/ --strip "Sample_" --strip ".filtered" -aDownload viral assembly and sequencing reads for training purposes:
votuderep trainingdata -o ./ebame-virome/Options:
-o, --outdir: Output directory [default: ./ebame-virome/]
Example:
# Download to default directory
votuderep trainingdata
# Download to custom directory
votuderep trainingdata -o ./training_data/Split a CoverM TSV by metric into separate TSVs, one per metric.
Reads a CoverM output table containing multiple metrics across samples and splits it into individual TSV files, one for each metric. Each output file will have the format: <basename>_<metric>.tsv
The input TSV is expected to have columns formatted as: "Contig", " ", " ", ...
votuderep splitcoverm -i coverage.tsv -o output/covOptions:
-i, --input: Input CoverM TSV (optionally gzipped: .gz) [required]-o, --output-basename: Output basename/prefix for generated files [required]
Examples:
# Basic usage
votuderep splitcoverm -i coverage.tsv -o output/cov
# With gzipped input
votuderep splitcoverm -i coverage.tsv.gz -o results/sample-v, --verbose: Enable verbose logging--version: Show version and exit--help: Show help message
MIT License - See LICENSE file for details
Contributions are welcome! Please feel free to submit a Pull Request.
Andrea Telatin & QIB Core Bioinformatics
©️ Quadram Institute Bioscience 2025
