Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
78d28b8
remove templates and whole workflow
jbrestel Jan 24, 2026
bb82cf6
reorganize modules and workflows
jbrestel Jan 24, 2026
f9ab1f5
Replace HISAT2 with BWA-MEM for read alignment
jbrestel Jan 24, 2026
5b48f76
Replace VarScan with FreeBayes for variant calling
jbrestel Jan 25, 2026
89113cf
Rewrite processSequenceVariations in Julia
jbrestel Feb 1, 2026
73104c9
Update Dockerfile to use BWA-MEM and FreeBayes
jbrestel Feb 2, 2026
061af27
Merge branch 'refactor-01-26' into merge-experiments-refactor
jbrestel Feb 2, 2026
31b379c
Fix Julia download URL and update to v1.10.10
jbrestel Feb 2, 2026
852fa67
Refactor processSequenceVariations.jl: break up 512-line main() into …
jbrestel Feb 2, 2026
943ccb7
Add debug logging to processSequenceVariations.jl
jbrestel Feb 2, 2026
6ef9788
ProcessSingleExperiment functional. Using unmasked reference sequence…
rdemko2332 Feb 6, 2026
a5809a8
Refactor processSingleExperiment to use nf-core samplesheet format
jbrestel Feb 10, 2026
31b8246
Merge branch 'refactor-01-26' into merge-experiments-refactor
jbrestel Feb 10, 2026
5fa4d3e
Updating base image
rdemko2332 Feb 10, 2026
09df4f8
wip
jbrestel Feb 13, 2026
cb3e6de
Adding forward slash to ADD line for perl
rdemko2332 Feb 20, 2026
042fde5
Resolving perl module issues in workflow and container
rdemko2332 Feb 20, 2026
2d1c718
remove subsampling and better documentation for some alignment methods
jbrestel Feb 20, 2026
fcc67c3
merge
jbrestel Feb 20, 2026
dbc092d
remove the samtools depth step
jbrestel Feb 20, 2026
5f5c470
Adding new freebayes argument
rdemko2332 Feb 20, 2026
b0b9359
Removing unneeded output declaration from freebayes
rdemko2332 Feb 20, 2026
0b249a6
add stats
jbrestel Feb 20, 2026
6b3d9fb
Merge branch 'refactor-01-26' of github.com:VEuPathDB/dnaseq-nextflow…
jbrestel Feb 20, 2026
c9f6346
Simplify variant pipeline: output unfiltered VCF from freebayes, remo…
jbrestel Feb 20, 2026
5c4c671
Tidy output artifacts: add dedicated normaliseCoverageToBigWig proces…
jbrestel Feb 20, 2026
3b451ef
no chunk for vcf
jbrestel Feb 27, 2026
41804ad
vcf file for indels is compressed
jbrestel Feb 27, 2026
350f47e
Separate coverage gVCF from variant calling, consolidate indel output…
jbrestel Feb 27, 2026
9c91fc4
do not publish some extra files. reorganize some outputs
jbrestel Mar 3, 2026
42d51fe
Merge branch 'refactor-01-26' into merge-experiments-refactor
jbrestel Mar 3, 2026
3e2a886
Merge branch 'main' into refactor-01-26
jbrestel Mar 3, 2026
cd1c6b4
Merge branch 'refactor-01-26' into merge-experiments-refactor
jbrestel Mar 3, 2026
2e83b99
Merge branch 'main' into merge-experiments-refactor
jbrestel Mar 3, 2026
08de997
update claude.md
jbrestel Mar 4, 2026
860a1dc
Refactor mergeExperiments params: replace inputDir glob with explicit…
jbrestel Mar 6, 2026
8f7ab50
Add CDS-prep pipeline: makeGenomicIndelDb, makeCodingData, GtfUtils.jl
jbrestel Mar 6, 2026
3eb8d3a
Fix parse_gtf exon_number consistency bug; clean up test structure
jbrestel Mar 6, 2026
9579ccc
Refactor mergeVcfs/gVcfs: fix indexing, add gVCF merge, remove addSam…
jbrestel Mar 10, 2026
0b1508a
gu+r
jbrestel Mar 10, 2026
ead1cd8
Resolving nextflow and julia interaction issues
rdemko2332 Mar 12, 2026
cb5c18c
Removing unneeded data files
rdemko2332 Mar 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 61 additions & 152 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,188 +1,97 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Overview

This is a Nextflow-based DNA sequencing analysis pipeline for processing genomic variation data. The pipeline processes raw sequencing reads (FASTQ) through alignment, variant calling, and CNV analysis, then merges results across multiple strains for downstream database loading and analysis.
Nextflow DSL2 pipeline for DNA sequencing analysis: FASTQ → alignment → variant calling → CNV → multi-strain merge → GUS database loading.

**Status**: Under construction, not used in production.

## Running Workflows

### Run the default workflow (processSingleExperiment)
```bash
# Default (processSingleExperiment)
nextflow run main.nf -profile processSingleExperiment
```

### Run specific workflows
```bash
# Process single experiment (alignment, variant calling, CNV)
# Named entry points
nextflow run main.nf -entry processSingleExperiment -profile processSingleExperiment

# Merge results across multiple experiments
nextflow run main.nf -entry mergeExperiments -profile mergeExperiments

# Load single experiment results to database
nextflow run main.nf -entry loadSingleExperiment -profile loadSingleExperiment

# Run test suite
nextflow run main.nf -entry runTests -profile tests
nextflow run main.nf -entry mergeExperiments -profile mergeExperiments
nextflow run main.nf -entry loadSingleExperiment -profile loadSingleExperiment
nextflow run main.nf -entry runTests -profile tests
```

### Docker execution
All workflows are designed to run in Docker containers. The profile configurations enable Docker by default (see `nextflow.config`).
Docker is enabled by default in all profiles.

## Architecture

### Workflow Organization

The codebase follows a three-tier structure:

1. **main.nf** - Entry point defining four named workflows: `processSingleExperiment`, `mergeExperiments`, `loadSingleExperiment`, and `runTests`
2. **workflows/** - High-level workflow orchestration that composes modules
3. **modules/** - Reusable process definitions grouped by function

### Three Primary Workflows

#### 1. processSingleExperiment (ps)
**Purpose**: Per-strain analysis from raw reads to variant calls and CNV data
**Input**: FASTQ files for individual strains (via nf-core samplesheet format)
**Output**: Consensus FASTA, VCF files, indel tables, ploidy estimates, gene CNVs, coverage bigWigs

**Pipeline stages**:
- Preprocessing: Quality control (FastQC), trimming (Trimmomatic)
- Alignment: BWA-MEM alignment, Picard deduplication, GATK realignment
- Variant calling: FreeBayes → SNP/indel separation → consensus genome generation
- CNV analysis: Coverage calculation, gene copy number estimation, ploidy determination
- Windowed analysis: SNP density, heterozygous SNP density, normalized coverage

**Key modules**: `preprocessing.nf`, `alignment.nf`, `snp.nf`, `cnv.nf`

#### 2. mergeExperiments (me)
**Purpose**: Combine multi-strain outputs and prepare for database loading
**Input**: Consensus FASTAs and VCF files from processSingleExperiment
**Output**: Merged VCF, annotated variation files, database load files, SnpEff annotations
Three-tier structure: `main.nf` → `workflows/` → `modules/`

**Pipeline stages**:
- Merge VCFs across all strains
- Process sequence variations using `bin/processSequenceVariations.jl` (Julia implementation replacing legacy Perl)
- Annotate variants with transcript/gene features
- Generate database load files (variation, product, allele tables)
- Run SnpEff for functional annotation
### Workflows

**Key modules**: `mergeExperiments.nf`
| Workflow | Purpose | Key modules |
|---|---|---|
| `processSingleExperiment` | Per-strain: FASTQ → consensus FASTA + VCF + coverage | preprocessing.nf, alignment.nf, snp.nf, cnv.nf |
| `mergeExperiments` | Multi-strain: merge VCFs, annotate variants, generate DB load files | mergeExperiments.nf |
| `loadSingleExperiment` | Load indel/ploidy/CNV data into GUS database | loadSingleExperiment.nf |
| `runTests` | Perl Test2::V0 test suite | runTests.nf |

#### 3. loadSingleExperiment (ls)
**Purpose**: Load per-strain indel and CNV data into GUS database
**Input**: Indel TSV files, ploidy files, gene CNV files
**Key modules**: `loadSingleExperiment.nf`
### processSingleExperiment stages
1. QC: FastQC, Trimmomatic
2. Alignment: BWA-MEM → Picard dedup → GATK indel realignment
3. Variant calling: FreeBayes → indel TSV → consensus + masked genome
4. CNV: bedtools coverage → htseq-count → TPM → ploidy + gene CNV
5. Windowed: SNP density, heterozygous SNP density, normalized coverage BigWigs

### Module Structure
### mergeExperiments stages
1. Merge VCFs across strains (bcftools)
2. `bin/processSequenceVariations.jl` — annotates variants via SQLite transcript/indel DBs; outputs cache + variation/allele/product DAT files
3. Add GUS feature IDs, generate DB load files
4. SnpEff functional annotation

Modules are organized by analysis stage:
- **preprocessing.nf**: QC and trimming
- **alignment.nf**: Read alignment and BAM processing
- **snp.nf**: Variant calling and consensus generation
- **cnv.nf**: Copy number variation and coverage analysis
- **mergeExperiments.nf**: Multi-strain merging and annotation
- **loadSingleExperiment.nf**: Database loading
- **runTests.nf**: Test execution

### Key Processing Scripts

The `bin/` directory contains Perl and Julia scripts used by processes:

- **processSequenceVariations.jl**: Core variation annotation script (Julia rewrite, replaces processSequenceVariationsNew.pl)
- Merges SNP file with cache file
- Annotates coding variants with codon/product information via SQLite
- Uses transcript and indel databases
- Outputs: cache, snpFeature.dat, allele.dat, product.dat

- **Variant processing**: maskGenome.pl, makeSnpFile.pl, fixSeqId.pl
- **CNV calculation**: calculatePloidy.pl, calculateGeneCNVs.pl
- **Database utilities**: addFeatureIdsToVariation.pl, addExtDbRlsIdToVariation.pl

### Data Flow
## Key Files

```
FASTQ files (via samplesheet)
↓ (processSingleExperiment)
Per-strain: consensus FASTA + VCF + coverage
↓ (mergeExperiments)
Merged VCF + annotated variations + database files
↓ (loadSingleExperiment or database loading)
Populated GUS database
main.nf # Entry point, samplesheet parsing
nextflow.config # All profiles and parameters
workflows/
processSingleExperiment.nf
mergeExperiments.nf
loadSingleExperiment.nf
modules/
preprocessing.nf alignment.nf snp.nf cnv.nf
mergeExperiments.nf loadSingleExperiment.nf runTests.nf
bin/
processSequenceVariations.jl # Core variation annotation (Julia)
makeSnpFile.pl maskGenome.pl fixSeqId.pl
calculatePloidy.pl calculateGeneCNVs.pl
addFeatureIdsToVariation.pl addExtDbRlsIdToVariation.pl
testing/t/ # Perl test files
testing/lib/ # Test utilities
```

### Configuration

All parameters are defined in `nextflow.config` under profile-specific sections:
- Input/output directories
- Tool parameters (coverage thresholds, ploidy, variant calling parameters)
- Reference files (genome FASTA, GTF, footprints)
- Database connection details (for merge/load workflows)
## Configuration

Key parameters:
- `samplesheet`: Path to nf-core format CSV samplesheet (sample, fastq_1, fastq_2 columns)
- `minCoverage`: Minimum coverage threshold for variant calling and masking
- `ploidy`: Expected ploidy level
- `freebayesMinAltFraction`: Minimum allele frequency for variant calls
Key parameters in `nextflow.config` (profile-scoped):

## Development
| Parameter | Description |
|---|---|
| `samplesheet` | nf-core CSV (sample, fastq_1, fastq_2) |
| `genomeFastaFile` | Reference genome FASTA |
| `gtfFile` | Gene annotation GTF |
| `footprintFile` | Gene footprints for CNV |
| `minCoverage` | Min coverage for variant calling/masking |
| `ploidy` | Expected ploidy |
| `freebayesMinAltFraction` | Min allele frequency for FreeBayes calls |

### Container and Dependencies
## Containers

The Docker image (`veupathdb/shortreadaligner:1.0.0`) includes:
- Alignment tools: BWA, samtools, Picard, GATK
- Variant callers: FreeBayes, bcftools
- Analysis tools: bedtools, bedGraphToBigWig, htseq-count
- Languages: Perl (with BioPerl), Julia 1.10.10, Python
- VEuPathDB GUS framework components (for database loading)
- SnpEff for variant annotation
Each process declares its own Docker image. Key images:
- `veupathdb/shortreadaligner:1.0.0` — BWA, samtools, Picard, GATK3, FreeBayes, bcftools, bedtools, Julia 1.10.10, Perl/BioPerl, SnpEff
- `veupathdb/dnaseqanalysis:1.0.0` — Trimmomatic, htseq-count

Julia dependencies (precompiled in image): SQLite.jl
Julia deps (precompiled): `SQLite.jl`

### Testing
## Testing

Tests are located in `testing/t/` and use Perl's Test2::V0 framework:
```bash
nextflow run main.nf -entry runTests -profile tests
```

Test utilities are in `testing/lib/`.

### Recent Refactoring

The Julia implementation (`bin/processSequenceVariations.jl`) was recently refactored to break up a 512-line main() function into modular functions. The variant calling has also been migrated from Varscan to FreeBayes.

## Input Data Requirements

### processSingleExperiment

**Samplesheet** (CSV format, nf-core standard):
- `sample`: Sample identifier (required, no spaces)
- `fastq_1`: Path to R1/forward reads file (required)
- `fastq_2`: Path to R2/reverse reads file (optional - leave empty for single-end)

Example samplesheet.csv:
```csv
sample,fastq_1,fastq_2
7G8,/path/to/7G8_R1.fastq.gz,/path/to/7G8_R2.fastq.gz
CS2,/path/to/CS2_R1.fastq.gz,/path/to/CS2_R2.fastq.gz
5.1,/path/to/5.1_SE.fastq.gz,
```

**Other required files**:
- Reference genome FASTA
- Gene annotation GTF file
- Gene footprints file
- Trimmomatic adapters file (optional, defaults to built-in adapters)

### mergeExperiments
- Consensus FASTA files (*.fa.gz) from processSingleExperiment
- VCF files (result.vcf.gz) from processSingleExperiment
- Coverage files (*.coverage.txt)
- Transcript SQLite database
- Indel SQLite database
- Cache file, undoneStrains file, gusConfig file
Tests in `testing/t/` use Perl's `Test2::V0` framework, run via `prove`.
14 changes: 13 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,19 @@ ENV TABIX_VERSION=0.2.6

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y git ant build-essential wget unzip bcftools python3 tabix samtools perl default-jre unzip cpanminus bioperl emacs libjson-perl libmodule-install-rdf-perl libxml-parser-perl libdate-manip-perl libtext-csv-perl libstatistics-descriptive-perl libtree-dagnode-perl libxml-simple-perl bwa trimmomatic openjdk-21-jre-headless && apt-get clean && apt-get purge && rm -rf /var/lib/apt/lists/* /tmp/*
RUN apt-get update && apt-get install -y git ant build-essential wget unzip bcftools python3 tabix samtools perl default-jre unzip cpanminus bioperl emacs libjson-perl libmodule-install-rdf-perl libxml-parser-perl libdate-manip-perl libtext-csv-perl libstatistics-descriptive-perl libtree-dagnode-perl libxml-simple-perl bwa trimmomatic openjdk-21-jre-headless sqlite3 && apt-get clean && apt-get purge && rm -rf /var/lib/apt/lists/* /tmp/*

ENV JULIA_VERSION=1.10.10
RUN wget -q https://julialang-s3.julialang.org/bin/linux/x64/1.10/julia-${JULIA_VERSION}-linux-x86_64.tar.gz \
&& tar xzf julia-${JULIA_VERSION}-linux-x86_64.tar.gz \
&& mv julia-${JULIA_VERSION} /opt/julia \
&& rm julia-${JULIA_VERSION}-linux-x86_64.tar.gz
ENV PATH=/opt/julia/bin:$PATH
ENV JULIA_DEPOT_PATH=/opt/julia_depot

RUN mkdir -p /opt/julia_depot \
&& julia -e 'using Pkg; Pkg.add("SQLite"); Pkg.precompile()'
ENV JULIA_PROJECT=@v1.10

WORKDIR /gusApp/gus_home/lib/perl

Expand Down
Loading