|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +OneRoof is a Nextflow-based bioinformatics pipeline for base-calling, variant-calling, and consensus-calling of amplicon sequencing data. It supports both Nanopore (pod5/BAM/FASTQ) and Illumina (paired-end FASTQ) data, with particular focus on SARS-CoV-2 and H5N1 influenza genomic surveillance. |
| 8 | + |
| 9 | +## Key Commands |
| 10 | + |
| 11 | +### Development Environment Setup |
| 12 | +```bash |
| 13 | +# For environments with conda dependencies (full pipeline) |
| 14 | +pixi install --frozen |
| 15 | +pixi shell --frozen |
| 16 | + |
| 17 | +# For PyPI-only environments (Python development) |
| 18 | +uv venv |
| 19 | +source .venv/bin/activate # or .venv\Scripts\activate on Windows |
| 20 | +uv sync |
| 21 | +``` |
| 22 | + |
| 23 | +### Running the Pipeline |
| 24 | +```bash |
| 25 | +# Nanopore data from raw POD5s |
| 26 | +nextflow run . \ |
| 27 | + --pod5_dir my_pod5_dir \ |
| 28 | + --primer_bed my_primers.bed \ |
| 29 | + --refseq my_ref.fasta \ |
| 30 | + --ref_gbk my_ref.gbk \ |
| 31 | + --kit "SQK-NBD114-24" |
| 32 | + |
| 33 | +# Illumina data |
| 34 | +nextflow run . \ |
| 35 | + --illumina_fastq_dir my_illumina_reads/ \ |
| 36 | + --primer_bed my_primers.bed \ |
| 37 | + --refseq my_ref.fasta \ |
| 38 | + --ref_gbk my_ref.gbk |
| 39 | + |
| 40 | +# Run without containers (requires pixi environment) |
| 41 | +nextflow run . -profile containerless [options] |
| 42 | +``` |
| 43 | + |
| 44 | +### Code Quality & Testing |
| 45 | +```bash |
| 46 | +# Python linting and formatting |
| 47 | +ruff check . --exit-zero --fix --unsafe-fixes |
| 48 | +ruff format . |
| 49 | + |
| 50 | +# Run Python tests (using uv for speed) |
| 51 | +uv run pytest bin/test_*.py |
| 52 | +# Or run tests with tox for multiple environments |
| 53 | +tox |
| 54 | + |
| 55 | +# Build documentation |
| 56 | +just docs |
| 57 | + |
| 58 | +# IMPORTANT: Modifying README.md |
| 59 | +# The README.md in the project root is generated from docs/index.qmd |
| 60 | +# NEVER edit README.md directly - it will be overwritten |
| 61 | +# Always edit docs/index.qmd and re-render: |
| 62 | +just make-readme # or: just docs |
| 63 | + |
| 64 | +# Docker operations |
| 65 | +just docker-build |
| 66 | +just docker-push |
| 67 | +``` |
| 68 | + |
| 69 | +## Architecture |
| 70 | + |
| 71 | +### Directory Structure |
| 72 | +- `main.nf` - Main workflow entry point that orchestrates platform-specific workflows |
| 73 | +- `workflows/` - Platform-specific workflows (nanopore.nf, illumina.nf) |
| 74 | +- `subworkflows/` - Reusable workflow components (alignment, variant_calling, primer_handling, etc.) |
| 75 | +- `modules/` - Individual process definitions for tools (dorado, minimap2, ivar, etc.) |
| 76 | +- `bin/` - Python utility scripts with PEP 723 inline dependencies (fully portable with uv) |
| 77 | +- `conf/` - Configuration files for different platforms and tools |
| 78 | + |
| 79 | +### Key Workflow Components |
| 80 | + |
| 81 | +1. **Data Ingestion** - Handles multiple input formats (pod5, BAM, FASTQ) with optional remote file watching |
| 82 | +2. **Primer Handling** - Validates primers, trims reads, and ensures complete amplicons |
| 83 | +3. **Alignment & Variant Calling** - Platform-specific alignment and variant calling using minimap2 and ivar/bcftools |
| 84 | +4. **Quality Control** - FastQC, MultiQC, and custom coverage plotting |
| 85 | +5. **Consensus Generation** - Creates consensus sequences with configurable frequency thresholds |
| 86 | +6. **Optional Features** - Metagenomics (Sylph), phylogenetics (Nextclade), haplotyping (Devider) |
| 87 | + |
| 88 | +### Technology Stack |
| 89 | +- **Workflow Engine**: Nextflow DSL2 |
| 90 | +- **Container Support**: Docker, Singularity/Apptainer |
| 91 | +- **Environment Management**: Pixi (combines conda and PyPI dependencies), UV (fast Python package management) |
| 92 | +- **Languages**: Nextflow (Groovy), Python 3.10+ |
| 93 | +- **Key Tools**: Dorado (basecalling), minimap2 (alignment), ivar/bcftools (variants), FastQC/MultiQC (QC) |
| 94 | + |
| 95 | +### Configuration Philosophy |
| 96 | +- Parameters are primarily set via command line arguments |
| 97 | +- Platform-specific configs (nanopore.config, illumina.config) are auto-loaded based on input data type |
| 98 | +- Container profiles (docker, singularity, apptainer, containerless) control execution environment |
| 99 | +- Advanced users can modify nextflow.config for fine-tuning |
| 100 | + |
| 101 | +### Important Parameters |
| 102 | +- `--pod5_batch_size`: Controls GPU memory usage during basecalling |
| 103 | +- `--min_variant_frequency`: Platform-specific defaults (0.05 for Illumina, 0.10 for Nanopore) |
| 104 | +- `--downsample_to`: Manages computational resources by limiting coverage depth |
| 105 | +- `--model`: Nanopore basecalling model (defaults to sup@latest) |
| 106 | + |
| 107 | +## Dependency Management |
| 108 | + |
| 109 | +### Python Package Management |
| 110 | +- **Always use `uv` instead of `pip`** for any Python package installation - it's significantly faster and more reliable |
| 111 | +- **Use `uv` for PyPI-only environments**: When working with Python scripts that only need PyPI dependencies |
| 112 | +- **Use `pixi` for mixed environments**: When conda dependencies are required (e.g., for the full pipeline) |
| 113 | +- **Script execution**: Always use `uv run` instead of `python3` to execute Python scripts |
| 114 | + ```bash |
| 115 | + # Good - uses inline dependencies from PEP 723 headers |
| 116 | + uv run bin/some_script.py |
| 117 | + |
| 118 | + # Avoid - doesn't guarantee dependencies |
| 119 | + python3 bin/some_script.py |
| 120 | + ``` |
| 121 | +- **Portable scripts**: All scripts in `bin/` include PEP 723 inline dependencies, making them fully portable with uv |
| 122 | +- **Benefits**: This approach eliminates dependency hell in Python by ensuring consistent, reproducible environments |
| 123 | + |
| 124 | +### Testing Infrastructure |
| 125 | +- **Comprehensive test coverage**: Python scripts in `bin/` have extensive test coverage using pytest |
| 126 | +- **Test execution**: Tests can be run quickly with UV for PyPI-only environments |
| 127 | + ```bash |
| 128 | + # Run all tests |
| 129 | + uv run pytest bin/test_*.py |
| 130 | + |
| 131 | + # Run specific test |
| 132 | + uv run pytest bin/test_specific_module.py |
| 133 | + ``` |
| 134 | +- **CI/CD**: The continuous integration pipeline uses UV instead of pip for improved speed and reliability |
| 135 | +- **Test organization**: Test files follow the pattern `test_*.py` and are colocated with the scripts they test |
| 136 | + |
| 137 | +## Development Notes |
| 138 | + |
| 139 | +1. **Testing**: Python scripts have comprehensive test coverage; Nextflow workflow tests are planned for future implementation |
| 140 | +2. **GPU Requirements**: Nanopore basecalling requires CUDA-capable GPUs |
| 141 | +3. **Memory Management**: Use `--low_memory` flag for resource-constrained environments |
| 142 | +4. **Slack Integration**: Optional alerts can be configured for pipeline completion |
| 143 | +5. **Dependency Management**: Always use `uv` for Python operations to ensure fast, reliable dependency resolution |
0 commit comments