VirNucPro is a viral sequence classifier using DNABERT_S and ESM2-3B language models for identifying short viral sequences (300bp or 500bp). This standalone Docker container provides GPU-accelerated classification with multi-GPU parallel processing support, automatic CPU fallback, and is designed for cloud pipeline deployment (GCE/dsub/Cromwell/WDL).
Features (v2.0):
- 6.2x speedup over v1.0 through async DataLoader and sequence packing
- Multi-GPU scaling with 93.7% efficiency on ESM-2
- FlashAttention varlen for packed attention without cross-sequence contamination
- FP16 precision with >0.99 cosine similarity to FP32
- Checkpoint-based resume for interrupted runs (SIGTERM handling)
- Backward-compatible BAM input interface
- v1.0 fallback mode for exact reproduction
User (Cloud Pipeline)
|
| docker run virnucpro:latest input.bam output.tsv
v
+---------------------------+
| Docker Container |
| |
| +-----------------+ |
| | CLI Entry Point | |
| | (virnucpro_cli.py)| |
| +-----------------+ |
| | |
| v |
| +-----------------+ |
| | Core Module | |
| | (virnucpro.py) | |
| +-----------------+ |
| | |
| | BAM input |
| v |
| +-----------------+ |
| | pysam | |
| | BAM → FASTA | |
| +-----------------+ |
| | |
| | FASTA (with dups)|
| v |
| +-----------------+ |
| | Deduplication | |
| | _ensure_unique | |
| +-----------------+ |
| | |
| | FASTA (unique) |
| v |
| +-----------------+ |
| | VirNucPro CLI | |
| | subprocess call | |
| +-----------------+ |
| | |
| v |
| +------------------+ |
| | python -m | |
| | virnucpro predict| |
| +------------------+ |
| | |
| | Multi-GPU |
| | Parallel |
| v |
| +------------------+ |
| | DNABERT_S + | |
| | ESM2-3B | |
| | Feature Extract | |
| +------------------+ |
| | |
| | Classification |
| v |
| +------------------+ |
| | Results TSV | |
| +------------------+ |
| | |
+---------------------------+
|
| output.tsv
v
User receives TSV
Input: BAM file (unaligned reads, may have duplicate IDs)
|
| pysam converts to FASTA
v
FASTA (nucleotide sequences, possibly duplicate IDs)
|
| _ensure_unique_fasta_ids() adds _N suffix to duplicates
v
FASTA (unique sequence IDs required by ESM model)
|
| Written to temp directory
v
Subprocess: python -m virnucpro predict input.fasta --model-type 300|500
|
| VirNucPro performs (with optional multi-GPU parallel):
| - Sequence chunking to expected length
| - Six-frame translation (CPU parallel)
| - DNABERT_S embedding (nucleotide, GPU parallel)
| - ESM2-3B embedding (amino acid, GPU parallel)
| - Feature concatenation (CPU parallel)
| - MLP classification
v
Intermediate: input_unique_merged/prediction_results.txt
|
| Copy to user-specified output path
v
Output: TSV file
Sequence_ID Prediction score1 score2
read1 virus 0.95 0.05
read1_1 non-virus 0.12 0.88
# Basic usage (500bp, auto-detect GPU)
docker run -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv
# Specify 300bp model
docker run -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --expected-length 300
# Force CPU mode
docker run -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --no-gpu
# Force GPU mode
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --use-gpuFor large datasets, leverage multi-GPU parallel processing for 150-380x speedup:
# Use all available GPUs in parallel mode
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --parallel
# Specify specific GPUs
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --gpus 0,1,2,3 --parallel
# Custom batch sizes for memory-constrained systems
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --dnabert-batch-size 1024 --esm-batch-size 512
# Specify CPU threads for translation and merge steps
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --threads 16
# Resume interrupted run (v2.0)
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --parallel --resume
# Use v1.0 architecture for exact match with older results
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --parallel --v1-fallback| Option | Description |
|---|---|
--expected-length |
Expected sequence length: 300 or 500 (default: 500) |
--use-gpu |
Force GPU usage |
--no-gpu |
Force CPU usage (disable GPU) |
--gpus |
Comma-separated GPU IDs (e.g., "0,1,2") |
--parallel |
Enable multi-GPU parallel processing (v2.0 async architecture) |
--batch-size |
Batch size for prediction DataLoader |
--dnabert-batch-size |
Token batch size for DNABERT-S (default: 2048) |
--esm-batch-size |
Token batch size for ESM-2 (default: 2048) |
--threads |
CPU threads for translation and merge |
--persistent-models |
Keep models in GPU memory between stages |
--resume |
Resume from checkpoint if available (v2.0) |
--v1-fallback |
Use v1.0 multi-worker architecture for ESM-2 (v2.0) |
--v1-attention |
Use v1.0-compatible standard attention for exact match (v2.0) |
--verbose |
Enable debug logging |
--virnucpro-path |
Path to VirNucPro installation |
BAM File (unaligned reads):
- Unmapped BAM format (SAM flags indicate unmapped status)
- Paired-end reads supported (duplicate read names handled automatically)
- Empty BAM files produce header-only TSV output
TSV File (tab-separated values):
- Header:
Sequence_ID\tPrediction\tscore1\tscore2 - Columns:
Sequence_ID: Original read ID (with_Nsuffix if deduplicated)Prediction: Classification result (virusornon-virus)score1: Confidence score for virus classscore2: Confidence score for non-virus class
Example output:
Sequence_ID Prediction score1 score2
read1 virus 0.95 0.05
read1_1 non-virus 0.12 0.88
read2 virus 0.87 0.13
- RAM: Minimum 8GB (16GB+ recommended for large datasets)
- GPU: Optional; CUDA 12.6+ compatible GPU (V100/T4/A100/H100) for accelerated inference
- CUDA: 12.6+ (upgraded from 11.8 for PyTorch 2.8.0 support)
- PyTorch: 2.8.0+ (required for v2.0 async architecture)
- Storage: ~4GB for Docker image
- CPU Fallback: Automatic via
CUDA_VISIBLE_DEVICES="-1"when GPU unavailable
virnucpro_cli.py: Thin entry point handling argparse, logging setup, exception handlingvirnucpro.py: Reusable VirNucPro class with classify() method- Benefit: Testing core logic without CLI overhead; reusable VirNucPro class in other scripts; clear responsibility boundary
VirNucPro uses python -m virnucpro predict for command-line invocation. Subprocess execution provides:
- Isolation preventing PyTorch memory leaks in wrapper process
- Consistency with viral-classify pattern for tool integration
- Negligible overhead (~50-100ms process spawn) for minute-scale PyTorch inference
- Clean separation between BAM handling (wrapper) and ML inference (VirNucPro)
BAM files store paired-end reads with the same query name. The _bam_to_fasta() function adds standard /1 and /2 suffixes based on BAM flags:
- First in pair:
read1→read1/1 - Second in pair:
read1→read1/2 - Unpaired reads: no suffix added
- Matches standard conventions (e.g., samtools fasta output)
- Stage 1 (builder): Install build dependencies (gcc, git), clone VirNucPro, install Python packages
- Stage 2 (runtime): Copy only /opt/VirNucPro and Python packages, exclude build tools
- Result: Image size reduced from ~5GB to ~3.5GB; faster cloud VM startup; lower registry storage costs
VIRNUCPRO_VERSION: Git commit SHA for version trackingVIRNUCPRO_PATH: Installation directory (/opt/VirNucPro)CUDA_VISIBLE_DEVICES: GPU device selection, "-1" for CPU mode- Pattern from beast2-beagle-cuda allows version matrix builds
All FASTA sequence IDs must be unique within file. ESM model will crash if duplicate IDs present. Paired-end reads are made unique by adding /1 and /2 suffixes based on BAM flags.
expected_length parameter (300 or 500) must match model file:
- 300bp sequences require 300_model.pth
- 500bp sequences require 500_model.pth
- Mismatch produces invalid predictions (VirNucPro silently accepts but results meaningless)
VirNucPro creates subdirectories: {prefix}_nucleotide/, {prefix}_protein/, {prefix}_merged/. All temporary files must be cleaned up after classify() completes to prevent disk space exhaustion in long-running cloud jobs.
Empty BAM must produce TSV with header line only:
- Header format:
Sequence_ID\tPrediction\tscore1\tscore2\n - Zero data rows acceptable for downstream tools
- Prevents pipeline failures on empty input splits
Must check both process return code AND stderr for "Traceback". Python exceptions don't always set non-zero exit codes. Pattern from viral-classify classify/kb.py.
- Cost: More complex Dockerfile, longer build time (two stages)
- Benefit: Image size reduced ~30% (5GB → 3.5GB), faster cloud VM startup
- Decision: Cloud deployment benefits outweigh build complexity
- Cost: Adds /1 and /2 to paired-end read names
- Benefit: Matches standard conventions (samtools fasta), clearer than _N suffixes
- Decision: Standard conventions over custom deduplication scheme
- Cost: Python adds ~200 lines vs ~50 lines bash, requires Python testing
- Benefit: Better string manipulation, code reuse from viral-classify, easier testing
- Decision: Maintainability prioritized, Python not significantly more complex
- Cost: 14MB in image, models frozen at build time
- Benefit: No internet dependency, immediate execution, reproducible
- Decision: 14MB negligible relative to 2.5GB PyTorch, offline preferred
- Cost: ~50MB added to image
- Benefit: Users don't need separate BAM→FASTA conversion step
- Decision: UX improvement worth minimal size increase
# Clone repository
git clone <repository-url>
cd virnucpro-cuda
# Build Docker image
docker build -t virnucpro:latest .
# Verify build
docker run virnucpro:latest python --version
docker run virnucpro:latest ls /opt/VirNucProUnit Tests (requires pytest):
# Install dependencies
pip install pytest pytest-mock pysam
# Run unit tests
pytest tests/test_virnucpro.py tests/test_cli.pyIntegration Tests (requires Docker):
# Build test image
docker build -t virnucpro:test .
# Run integration tests
pytest tests/integration/
# Or use helper script
./tests/integration/run_integration.sh# Example dsub command
dsub \
--provider google-v2 \
--project <project-id> \
--regions us-central1 \
--logging gs://<bucket>/logs \
--image quay.io/broadinstitute/virnucpro:latest \
--input INPUT_BAM=gs://<bucket>/input.bam \
--output OUTPUT_TSV=gs://<bucket>/output.tsv \
--command '/opt/virnucpro_cli.py ${INPUT_BAM} ${OUTPUT_TSV} --expected-length 500 --no-gpu'Docker images tagged with git commit SHA and latest:
- Latest:
quay.io/broadinstitute/virnucpro:latest(main branch) - Specific version:
quay.io/broadinstitute/virnucpro:<commit-sha> - VIRNUCPRO_VERSION: Environment variable contains git commit SHA at build time
To pin to specific version:
docker pull quay.io/broadinstitute/virnucpro:<commit-sha>
docker run quay.io/broadinstitute/virnucpro:<commit-sha> ...GPU-enabled VMs:
# Auto-detect GPU (default)
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv
# Force GPU usage (fail if GPU unavailable)
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --use-gpuCPU-only VMs:
# Force CPU mode (no GPU required)
docker run -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --no-gpuImpact: Integration tests may fail in GitHub Actions CI but pass locally.
Details: The CI workflow sets VIRNUCPRO_DOCKER_IMAGE=virnucpro-cuda:test (.github/workflows/test.yml:63), but the integration test fixture builds the image as virnucpro:test (tests/integration/test_integration.py:23). This mismatch causes the tests to build a new image rather than using the CI-provided one.
Workaround: Tests will still pass by building the image themselves. This only affects CI pipeline efficiency, not functionality.
Status: Accepted as low-priority technical debt (QR Option B). Does not affect production usage.
See LICENSE file for licensing information.
- VirNucPro (Broad refactored v2.0): https://github.com/broadinstitute/virnucpro-broad
- VirNucPro (original): https://github.com/Li-Jing-1997/VirNucPro
- DNABERT_S: Language model for nucleotide sequences
- ESM2-3B: Protein language model from fair-esm