VEuPathDB · jbrestel · Jan 24, 2026 · Jan 24, 2026 · Jan 24, 2026 · Jan 25, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -1,188 +1,97 @@
 # CLAUDE.md
 
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-
-## Overview
-
-This is a Nextflow-based DNA sequencing analysis pipeline for processing genomic variation data. The pipeline processes raw sequencing reads (FASTQ) through alignment, variant calling, and CNV analysis, then merges results across multiple strains for downstream database loading and analysis.
+Nextflow DSL2 pipeline for DNA sequencing analysis: FASTQ → alignment → variant calling → CNV → multi-strain merge → GUS database loading.
 
 **Status**: Under construction, not used in production.
 
 ## Running Workflows
 
-### Run the default workflow (processSingleExperiment)
 ```bash
+# Default (processSingleExperiment)
 nextflow run main.nf -profile processSingleExperiment
-```
 
-### Run specific workflows
-```bash
-# Process single experiment (alignment, variant calling, CNV)
+# Named entry points
 nextflow run main.nf -entry processSingleExperiment -profile processSingleExperiment
-
-# Merge results across multiple experiments
-nextflow run main.nf -entry mergeExperiments -profile mergeExperiments
-
-# Load single experiment results to database
-nextflow run main.nf -entry loadSingleExperiment -profile loadSingleExperiment
-
-# Run test suite
-nextflow run main.nf -entry runTests -profile tests
+nextflow run main.nf -entry mergeExperiments        -profile mergeExperiments
+nextflow run main.nf -entry loadSingleExperiment    -profile loadSingleExperiment
+nextflow run main.nf -entry runTests                -profile tests
 ```
 
-### Docker execution
-All workflows are designed to run in Docker containers. The profile configurations enable Docker by default (see `nextflow.config`).
+Docker is enabled by default in all profiles.
 
 ## Architecture
 
-### Workflow Organization
-
-The codebase follows a three-tier structure:
-
-1. **main.nf** - Entry point defining four named workflows: `processSingleExperiment`, `mergeExperiments`, `loadSingleExperiment`, and `runTests`
-2. **workflows/** - High-level workflow orchestration that composes modules
-3. **modules/** - Reusable process definitions grouped by function
-
-### Three Primary Workflows
-
-#### 1. processSingleExperiment (ps)
-**Purpose**: Per-strain analysis from raw reads to variant calls and CNV data
-**Input**: FASTQ files for individual strains (via nf-core samplesheet format)
-**Output**: Consensus FASTA, VCF files, indel tables, ploidy estimates, gene CNVs, coverage bigWigs
-
-**Pipeline stages**:
-- Preprocessing: Quality control (FastQC), trimming (Trimmomatic)
-- Alignment: BWA-MEM alignment, Picard deduplication, GATK realignment
-- Variant calling: FreeBayes → SNP/indel separation → consensus genome generation
-- CNV analysis: Coverage calculation, gene copy number estimation, ploidy determination
-- Windowed analysis: SNP density, heterozygous SNP density, normalized coverage
-
-**Key modules**: `preprocessing.nf`, `alignment.nf`, `snp.nf`, `cnv.nf`
-
-#### 2. mergeExperiments (me)
-**Purpose**: Combine multi-strain outputs and prepare for database loading
-**Input**: Consensus FASTAs and VCF files from processSingleExperiment
-**Output**: Merged VCF, annotated variation files, database load files, SnpEff annotations
+Three-tier structure: `main.nf` → `workflows/` → `modules/`
 
-**Pipeline stages**:
-- Merge VCFs across all strains
-- Process sequence variations using `bin/processSequenceVariations.jl` (Julia implementation replacing legacy Perl)
-- Annotate variants with transcript/gene features
-- Generate database load files (variation, product, allele tables)
-- Run SnpEff for functional annotation
+### Workflows
 
-**Key modules**: `mergeExperiments.nf`
+| Workflow | Purpose | Key modules |
+|---|---|---|
+| `processSingleExperiment` | Per-strain: FASTQ → consensus FASTA + VCF + coverage | preprocessing.nf, alignment.nf, snp.nf, cnv.nf |
+| `mergeExperiments` | Multi-strain: merge VCFs, annotate variants, generate DB load files | mergeExperiments.nf |
+| `loadSingleExperiment` | Load indel/ploidy/CNV data into GUS database | loadSingleExperiment.nf |
+| `runTests` | Perl Test2::V0 test suite | runTests.nf |
 
-#### 3. loadSingleExperiment (ls)
-**Purpose**: Load per-strain indel and CNV data into GUS database
-**Input**: Indel TSV files, ploidy files, gene CNV files
-**Key modules**: `loadSingleExperiment.nf`
+### processSingleExperiment stages
+1. QC: FastQC, Trimmomatic
+2. Alignment: BWA-MEM → Picard dedup → GATK indel realignment
+3. Variant calling: FreeBayes → indel TSV → consensus + masked genome
+4. CNV: bedtools coverage → htseq-count → TPM → ploidy + gene CNV
+5. Windowed: SNP density, heterozygous SNP density, normalized coverage BigWigs
 
-### Module Structure
+### mergeExperiments stages
+1. Merge VCFs across strains (bcftools)
+2. `bin/processSequenceVariations.jl` — annotates variants via SQLite transcript/indel DBs; outputs cache + variation/allele/product DAT files
+3. Add GUS feature IDs, generate DB load files
+4. SnpEff functional annotation
 
-Modules are organized by analysis stage:
-- **preprocessing.nf**: QC and trimming
-- **alignment.nf**: Read alignment and BAM processing
-- **snp.nf**: Variant calling and consensus generation
-- **cnv.nf**: Copy number variation and coverage analysis
-- **mergeExperiments.nf**: Multi-strain merging and annotation
-- **loadSingleExperiment.nf**: Database loading
-- **runTests.nf**: Test execution
-
-### Key Processing Scripts
-
-The `bin/` directory contains Perl and Julia scripts used by processes:
-
-- **processSequenceVariations.jl**: Core variation annotation script (Julia rewrite, replaces processSequenceVariationsNew.pl)
-  - Merges SNP file with cache file
-  - Annotates coding variants with codon/product information via SQLite
-  - Uses transcript and indel databases
-  - Outputs: cache, snpFeature.dat, allele.dat, product.dat
-
-- **Variant processing**: maskGenome.pl, makeSnpFile.pl, fixSeqId.pl
-- **CNV calculation**: calculatePloidy.pl, calculateGeneCNVs.pl
-- **Database utilities**: addFeatureIdsToVariation.pl, addExtDbRlsIdToVariation.pl
-
-### Data Flow
+## Key Files
 
 ```
-FASTQ files (via samplesheet)
-    ↓ (processSingleExperiment)
-Per-strain: consensus FASTA + VCF + coverage
-    ↓ (mergeExperiments)
-Merged VCF + annotated variations + database files
-    ↓ (loadSingleExperiment or database loading)
-Populated GUS database
+main.nf                          # Entry point, samplesheet parsing
+nextflow.config                  # All profiles and parameters
+workflows/
+  processSingleExperiment.nf
+  mergeExperiments.nf
+  loadSingleExperiment.nf
+modules/
+  preprocessing.nf alignment.nf snp.nf cnv.nf
+  mergeExperiments.nf loadSingleExperiment.nf runTests.nf
+bin/
+  processSequenceVariations.jl   # Core variation annotation (Julia)
+  makeSnpFile.pl maskGenome.pl fixSeqId.pl
+  calculatePloidy.pl calculateGeneCNVs.pl
+  addFeatureIdsToVariation.pl addExtDbRlsIdToVariation.pl
+testing/t/                       # Perl test files
+testing/lib/                     # Test utilities
 ```
 
-### Configuration
-
-All parameters are defined in `nextflow.config` under profile-specific sections:
-- Input/output directories
-- Tool parameters (coverage thresholds, ploidy, variant calling parameters)
-- Reference files (genome FASTA, GTF, footprints)
-- Database connection details (for merge/load workflows)
+## Configuration
 
-Key parameters:
-- `samplesheet`: Path to nf-core format CSV samplesheet (sample, fastq_1, fastq_2 columns)
-- `minCoverage`: Minimum coverage threshold for variant calling and masking
-- `ploidy`: Expected ploidy level
-- `freebayesMinAltFraction`: Minimum allele frequency for variant calls
+Key parameters in `nextflow.config` (profile-scoped):
 
-## Development
+| Parameter | Description |
+|---|---|
+| `samplesheet` | nf-core CSV (sample, fastq_1, fastq_2) |
+| `genomeFastaFile` | Reference genome FASTA |
+| `gtfFile` | Gene annotation GTF |
+| `footprintFile` | Gene footprints for CNV |
+| `minCoverage` | Min coverage for variant calling/masking |
+| `ploidy` | Expected ploidy |
+| `freebayesMinAltFraction` | Min allele frequency for FreeBayes calls |
 
-### Container and Dependencies
+## Containers
 
-The Docker image (`veupathdb/shortreadaligner:1.0.0`) includes:
-- Alignment tools: BWA, samtools, Picard, GATK
-- Variant callers: FreeBayes, bcftools
-- Analysis tools: bedtools, bedGraphToBigWig, htseq-count
-- Languages: Perl (with BioPerl), Julia 1.10.10, Python
-- VEuPathDB GUS framework components (for database loading)
-- SnpEff for variant annotation
+Each process declares its own Docker image. Key images:
+- `veupathdb/shortreadaligner:1.0.0` — BWA, samtools, Picard, GATK3, FreeBayes, bcftools, bedtools, Julia 1.10.10, Perl/BioPerl, SnpEff
+- `veupathdb/dnaseqanalysis:1.0.0` — Trimmomatic, htseq-count
 
-Julia dependencies (precompiled in image): SQLite.jl
+Julia deps (precompiled): `SQLite.jl`
 
-### Testing
+## Testing
 
-Tests are located in `testing/t/` and use Perl's Test2::V0 framework:
 ```bash
 nextflow run main.nf -entry runTests -profile tests
 ```
 
-Test utilities are in `testing/lib/`.
-
-### Recent Refactoring
-
-The Julia implementation (`bin/processSequenceVariations.jl`) was recently refactored to break up a 512-line main() function into modular functions. The variant calling has also been migrated from Varscan to FreeBayes.
-
-## Input Data Requirements
-
-### processSingleExperiment
-
-**Samplesheet** (CSV format, nf-core standard):
-- `sample`: Sample identifier (required, no spaces)
-- `fastq_1`: Path to R1/forward reads file (required)
-- `fastq_2`: Path to R2/reverse reads file (optional - leave empty for single-end)
-
-Example samplesheet.csv:
-```csv
-sample,fastq_1,fastq_2
-7G8,/path/to/7G8_R1.fastq.gz,/path/to/7G8_R2.fastq.gz
-CS2,/path/to/CS2_R1.fastq.gz,/path/to/CS2_R2.fastq.gz
-5.1,/path/to/5.1_SE.fastq.gz,
-```
-
-**Other required files**:
-- Reference genome FASTA
-- Gene annotation GTF file
-- Gene footprints file
-- Trimmomatic adapters file (optional, defaults to built-in adapters)
-
-### mergeExperiments
-- Consensus FASTA files (*.fa.gz) from processSingleExperiment
-- VCF files (result.vcf.gz) from processSingleExperiment
-- Coverage files (*.coverage.txt)
-- Transcript SQLite database
-- Indel SQLite database
-- Cache file, undoneStrains file, gusConfig file
+Tests in `testing/t/` use Perl's `Test2::V0` framework, run via `prove`.
diff --git a/Dockerfile b/Dockerfile
@@ -5,7 +5,19 @@ ENV TABIX_VERSION=0.2.6
 
 ENV DEBIAN_FRONTEND=noninteractive
 
-RUN apt-get update && apt-get install -y git ant build-essential wget unzip bcftools python3 tabix samtools perl default-jre unzip cpanminus bioperl emacs libjson-perl libmodule-install-rdf-perl libxml-parser-perl libdate-manip-perl libtext-csv-perl libstatistics-descriptive-perl libtree-dagnode-perl libxml-simple-perl bwa trimmomatic openjdk-21-jre-headless && apt-get clean && apt-get purge && rm -rf /var/lib/apt/lists/* /tmp/*
+RUN apt-get update && apt-get install -y git ant build-essential wget unzip bcftools python3 tabix samtools perl default-jre unzip cpanminus bioperl emacs libjson-perl libmodule-install-rdf-perl libxml-parser-perl libdate-manip-perl libtext-csv-perl libstatistics-descriptive-perl libtree-dagnode-perl libxml-simple-perl bwa trimmomatic openjdk-21-jre-headless sqlite3 && apt-get clean && apt-get purge && rm -rf /var/lib/apt/lists/* /tmp/*
+
+ENV JULIA_VERSION=1.10.10
+RUN wget -q https://julialang-s3.julialang.org/bin/linux/x64/1.10/julia-${JULIA_VERSION}-linux-x86_64.tar.gz \
+    && tar xzf julia-${JULIA_VERSION}-linux-x86_64.tar.gz \
+    && mv julia-${JULIA_VERSION} /opt/julia \
+    && rm julia-${JULIA_VERSION}-linux-x86_64.tar.gz
+ENV PATH=/opt/julia/bin:$PATH
+ENV JULIA_DEPOT_PATH=/opt/julia_depot
+
+RUN mkdir -p /opt/julia_depot \
+ && julia -e 'using Pkg; Pkg.add("SQLite"); Pkg.precompile()'
+ENV JULIA_PROJECT=@v1.10
 
 WORKDIR /gusApp/gus_home/lib/perl