VirNucPro Standalone Docker Container

VirNucPro is a viral sequence classifier using DNABERT_S and ESM2-3B language models for identifying short viral sequences (300bp or 500bp). This standalone Docker container provides GPU-accelerated classification with multi-GPU parallel processing support, automatic CPU fallback, and is designed for cloud pipeline deployment (GCE/dsub/Cromwell/WDL).

Features (v2.0):

6.2x speedup over v1.0 through async DataLoader and sequence packing
Multi-GPU scaling with 93.7% efficiency on ESM-2
FlashAttention varlen for packed attention without cross-sequence contamination
FP16 precision with >0.99 cosine similarity to FP32
Checkpoint-based resume for interrupted runs (SIGTERM handling)
Backward-compatible BAM input interface
v1.0 fallback mode for exact reproduction

Architecture

User (Cloud Pipeline)
    |
    | docker run virnucpro:latest input.bam output.tsv
    v
+---------------------------+
| Docker Container          |
|                           |
|  +-----------------+      |
|  | CLI Entry Point |      |
|  | (virnucpro_cli.py)|    |
|  +-----------------+      |
|        |                  |
|        v                  |
|  +-----------------+      |
|  | Core Module     |      |
|  | (virnucpro.py)  |      |
|  +-----------------+      |
|        |                  |
|        | BAM input        |
|        v                  |
|  +-----------------+      |
|  | pysam           |      |
|  | BAM → FASTA     |      |
|  +-----------------+      |
|        |                  |
|        | FASTA (with dups)|
|        v                  |
|  +-----------------+      |
|  | Deduplication   |      |
|  | _ensure_unique  |      |
|  +-----------------+      |
|        |                  |
|        | FASTA (unique)   |
|        v                  |
|  +-----------------+      |
|  | VirNucPro CLI   |      |
|  | subprocess call |      |
|  +-----------------+      |
|        |                  |
|        v                  |
|  +------------------+     |
|  | python -m        |     |
|  | virnucpro predict|     |
|  +------------------+     |
|        |                  |
|        | Multi-GPU        |
|        | Parallel         |
|        v                  |
|  +------------------+     |
|  | DNABERT_S +      |     |
|  | ESM2-3B          |     |
|  | Feature Extract  |     |
|  +------------------+     |
|        |                  |
|        | Classification   |
|        v                  |
|  +------------------+     |
|  | Results TSV      |     |
|  +------------------+     |
|        |                  |
+---------------------------+
         |
         | output.tsv
         v
    User receives TSV

Data Flow

Input: BAM file (unaligned reads, may have duplicate IDs)
   |
   | pysam converts to FASTA
   v
FASTA (nucleotide sequences, possibly duplicate IDs)
   |
   | _ensure_unique_fasta_ids() adds _N suffix to duplicates
   v
FASTA (unique sequence IDs required by ESM model)
   |
   | Written to temp directory
   v
Subprocess: python -m virnucpro predict input.fasta --model-type 300|500
   |
   | VirNucPro performs (with optional multi-GPU parallel):
   |  - Sequence chunking to expected length
   |  - Six-frame translation (CPU parallel)
   |  - DNABERT_S embedding (nucleotide, GPU parallel)
   |  - ESM2-3B embedding (amino acid, GPU parallel)
   |  - Feature concatenation (CPU parallel)
   |  - MLP classification
   v
Intermediate: input_unique_merged/prediction_results.txt
   |
   | Copy to user-specified output path
   v
Output: TSV file
Sequence_ID    Prediction    score1    score2
read1          virus         0.95      0.05
read1_1        non-virus     0.12      0.88

Usage

Basic Classification

# Basic usage (500bp, auto-detect GPU)
docker run -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv

# Specify 300bp model
docker run -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --expected-length 300

# Force CPU mode
docker run -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --no-gpu

# Force GPU mode
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --use-gpu

Multi-GPU Parallel Processing

For large datasets, leverage multi-GPU parallel processing for 150-380x speedup:

# Use all available GPUs in parallel mode
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --parallel

# Specify specific GPUs
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --gpus 0,1,2,3 --parallel

# Custom batch sizes for memory-constrained systems
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --dnabert-batch-size 1024 --esm-batch-size 512

# Specify CPU threads for translation and merge steps
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --threads 16

# Resume interrupted run (v2.0)
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --parallel --resume

# Use v1.0 architecture for exact match with older results
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --parallel --v1-fallback

CLI Options

Option	Description
`--expected-length`	Expected sequence length: 300 or 500 (default: 500)
`--use-gpu`	Force GPU usage
`--no-gpu`	Force CPU usage (disable GPU)
`--gpus`	Comma-separated GPU IDs (e.g., "0,1,2")
`--parallel`	Enable multi-GPU parallel processing (v2.0 async architecture)
`--batch-size`	Batch size for prediction DataLoader
`--dnabert-batch-size`	Token batch size for DNABERT-S (default: 2048)
`--esm-batch-size`	Token batch size for ESM-2 (default: 2048)
`--threads`	CPU threads for translation and merge
`--persistent-models`	Keep models in GPU memory between stages
`--resume`	Resume from checkpoint if available (v2.0)
`--v1-fallback`	Use v1.0 multi-worker architecture for ESM-2 (v2.0)
`--v1-attention`	Use v1.0-compatible standard attention for exact match (v2.0)
`--verbose`	Enable debug logging
`--virnucpro-path`	Path to VirNucPro installation

Input Format

BAM File (unaligned reads):

Unmapped BAM format (SAM flags indicate unmapped status)
Paired-end reads supported (duplicate read names handled automatically)
Empty BAM files produce header-only TSV output

Output Format

TSV File (tab-separated values):

Header: Sequence_ID\tPrediction\tscore1\tscore2
Columns:
- Sequence_ID: Original read ID (with _N suffix if deduplicated)
- Prediction: Classification result (virus or non-virus)
- score1: Confidence score for virus class
- score2: Confidence score for non-virus class

Example output:

Sequence_ID    Prediction    score1    score2
read1          virus         0.95      0.05
read1_1        non-virus     0.12      0.88
read2          virus         0.87      0.13

System Requirements

RAM: Minimum 8GB (16GB+ recommended for large datasets)
GPU: Optional; CUDA 12.6+ compatible GPU (V100/T4/A100/H100) for accelerated inference
CUDA: 12.6+ (upgraded from 11.8 for PyTorch 2.8.0 support)
PyTorch: 2.8.0+ (required for v2.0 async architecture)
Storage: ~4GB for Docker image
CPU Fallback: Automatic via CUDA_VISIBLE_DEVICES="-1" when GPU unavailable

Design Decisions

Separate CLI and Core Module

virnucpro_cli.py: Thin entry point handling argparse, logging setup, exception handling
virnucpro.py: Reusable VirNucPro class with classify() method
Benefit: Testing core logic without CLI overhead; reusable VirNucPro class in other scripts; clear responsibility boundary

VirNucPro as Subprocess

VirNucPro uses python -m virnucpro predict for command-line invocation. Subprocess execution provides:

Isolation preventing PyTorch memory leaks in wrapper process
Consistency with viral-classify pattern for tool integration
Negligible overhead (~50-100ms process spawn) for minute-scale PyTorch inference
Clean separation between BAM handling (wrapper) and ML inference (VirNucPro)

Paired-End Read Handling

BAM files store paired-end reads with the same query name. The _bam_to_fasta() function adds standard /1 and /2 suffixes based on BAM flags:

First in pair: read1 → read1/1
Second in pair: read1 → read1/2
Unpaired reads: no suffix added
Matches standard conventions (e.g., samtools fasta output)

Multi-Stage Docker Build

Stage 1 (builder): Install build dependencies (gcc, git), clone VirNucPro, install Python packages
Stage 2 (runtime): Copy only /opt/VirNucPro and Python packages, exclude build tools
Result: Image size reduced from ~5GB to ~3.5GB; faster cloud VM startup; lower registry storage costs

Environment Variable Configuration

VIRNUCPRO_VERSION: Git commit SHA for version tracking
VIRNUCPRO_PATH: Installation directory (/opt/VirNucPro)
CUDA_VISIBLE_DEVICES: GPU device selection, "-1" for CPU mode
Pattern from beast2-beagle-cuda allows version matrix builds

Invariants

FASTA ID Uniqueness

All FASTA sequence IDs must be unique within file. ESM model will crash if duplicate IDs present. Paired-end reads are made unique by adding /1 and /2 suffixes based on BAM flags.

Sequence Length Matching

expected_length parameter (300 or 500) must match model file:

300bp sequences require 300_model.pth
500bp sequences require 500_model.pth
Mismatch produces invalid predictions (VirNucPro silently accepts but results meaningless)

Temp Directory Cleanup

VirNucPro creates subdirectories: {prefix}_nucleotide/, {prefix}_protein/, {prefix}_merged/. All temporary files must be cleaned up after classify() completes to prevent disk space exhaustion in long-running cloud jobs.

Empty Input Handling

Empty BAM must produce TSV with header line only:

Header format: Sequence_ID\tPrediction\tscore1\tscore2\n
Zero data rows acceptable for downstream tools
Prevents pipeline failures on empty input splits

Subprocess Error Detection

Must check both process return code AND stderr for "Traceback". Python exceptions don't always set non-zero exit codes. Pattern from viral-classify classify/kb.py.

Tradeoffs

Multi-Stage Build (Size vs Complexity)

Cost: More complex Dockerfile, longer build time (two stages)
Benefit: Image size reduced ~30% (5GB → 3.5GB), faster cloud VM startup
Decision: Cloud deployment benefits outweigh build complexity

Paired-End Suffix Convention (/1 /2 vs _N deduplication)

Cost: Adds /1 and /2 to paired-end read names
Benefit: Matches standard conventions (samtools fasta), clearer than _N suffixes
Decision: Standard conventions over custom deduplication scheme

Python Wrapper vs Bash (Maintainability vs Simplicity)

Cost: Python adds ~200 lines vs ~50 lines bash, requires Python testing
Benefit: Better string manipulation, code reuse from viral-classify, easier testing
Decision: Maintainability prioritized, Python not significantly more complex

Bundled Models vs Download-on-Demand (Image Size vs Offline)

Cost: 14MB in image, models frozen at build time
Benefit: No internet dependency, immediate execution, reproducible
Decision: 14MB negligible relative to 2.5GB PyTorch, offline preferred

Include Samtools (Image Size vs UX)

Cost: ~50MB added to image
Benefit: Users don't need separate BAM→FASTA conversion step
Decision: UX improvement worth minimal size increase

Development

Building Locally

# Clone repository
git clone <repository-url>
cd virnucpro-cuda

# Build Docker image
docker build -t virnucpro:latest .

# Verify build
docker run virnucpro:latest python --version
docker run virnucpro:latest ls /opt/VirNucPro

Running Tests

Unit Tests (requires pytest):

# Install dependencies
pip install pytest pytest-mock pysam

# Run unit tests
pytest tests/test_virnucpro.py tests/test_cli.py

Integration Tests (requires Docker):

# Build test image
docker build -t virnucpro:test .

# Run integration tests
pytest tests/integration/

# Or use helper script
./tests/integration/run_integration.sh

Deployment

Cloud Pipelines (GCE/dsub)

# Example dsub command
dsub \
  --provider google-v2 \
  --project <project-id> \
  --regions us-central1 \
  --logging gs://<bucket>/logs \
  --image quay.io/broadinstitute/virnucpro:latest \
  --input INPUT_BAM=gs://<bucket>/input.bam \
  --output OUTPUT_TSV=gs://<bucket>/output.tsv \
  --command '/opt/virnucpro_cli.py ${INPUT_BAM} ${OUTPUT_TSV} --expected-length 500 --no-gpu'

Version Management

Docker images tagged with git commit SHA and latest:

Latest: quay.io/broadinstitute/virnucpro:latest (main branch)
Specific version: quay.io/broadinstitute/virnucpro:<commit-sha>
VIRNUCPRO_VERSION: Environment variable contains git commit SHA at build time

To pin to specific version:

docker pull quay.io/broadinstitute/virnucpro:<commit-sha>
docker run quay.io/broadinstitute/virnucpro:<commit-sha> ...

GPU Configuration

GPU-enabled VMs:

# Auto-detect GPU (default)
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv

# Force GPU usage (fail if GPU unavailable)
docker run --gpus all -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --use-gpu

CPU-only VMs:

# Force CPU mode (no GPU required)
docker run -v $(pwd):/data virnucpro:latest /opt/virnucpro_cli.py /data/input.bam /data/output.tsv --no-gpu

Known Issues

CI/CD Docker Image Tag Mismatch

Impact: Integration tests may fail in GitHub Actions CI but pass locally.

Details: The CI workflow sets VIRNUCPRO_DOCKER_IMAGE=virnucpro-cuda:test (.github/workflows/test.yml:63), but the integration test fixture builds the image as virnucpro:test (tests/integration/test_integration.py:23). This mismatch causes the tests to build a new image rather than using the CI-provided one.

Workaround: Tests will still pass by building the image themselves. This only affects CI pipeline efficiency, not functionality.

Status: Accepted as low-priority technical debt (QR Option B). Does not affect production usage.

License

See LICENSE file for licensing information.

References

VirNucPro (Broad refactored v2.0): https://github.com/broadinstitute/virnucpro-broad
VirNucPro (original): https://github.com/Li-Jing-1997/VirNucPro
DNABERT_S: Language model for nucleotide sequences
ESM2-3B: Protein language model from fair-esm

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
requirements.txt		requirements.txt
virnucpro.py		virnucpro.py
virnucpro_cli.py		virnucpro_cli.py

License

broadinstitute/virnucpro-cuda

Folders and files

Latest commit

History

Repository files navigation

VirNucPro Standalone Docker Container

Architecture

Data Flow

Usage

Basic Classification

Multi-GPU Parallel Processing

CLI Options

Input Format

Output Format

System Requirements

Design Decisions

Separate CLI and Core Module

VirNucPro as Subprocess

Paired-End Read Handling

Multi-Stage Docker Build

Environment Variable Configuration

Invariants

FASTA ID Uniqueness

Sequence Length Matching

Temp Directory Cleanup

Empty Input Handling

Subprocess Error Detection

Tradeoffs

Multi-Stage Build (Size vs Complexity)

Paired-End Suffix Convention (/1 /2 vs _N deduplication)

Python Wrapper vs Bash (Maintainability vs Simplicity)

Bundled Models vs Download-on-Demand (Image Size vs Offline)

Include Samtools (Image Size vs UX)

Development

Building Locally

Running Tests

Deployment

Cloud Pipelines (GCE/dsub)

Version Management

GPU Configuration

Known Issues

CI/CD Docker Image Tag Mismatch

License

References

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages