Rewrite processSequenceVariations by jbrestel · Pull Request #4 · VEuPathDB/dnaseq-nextflow

jbrestel · 2026-02-01T22:45:10Z

Replace bin/processSequenceVariationsNew.pl with bin/processSequenceVariations.jl. Sequence access moves from per-call samtools faidx subprocess spawns to a single SQLite bulk-fetch per transcript cached as an in-memory Dict. The O(N²) sed-based line reading in getVariations is replaced with a single-pass sorted merge over two open file handles. CDS lookup switches from a linear scan of all transcripts to binary search on a sorted interval array. Per-strain coordinate shifting is eliminated — the upstream transcript-prep process pre-splices CDS sequences into transcript coordinates, and a lightweight indel-shift SQL query handles position adjustment. Frameshift detection is precomputed at startup from the indels DB.

Nextflow changes: processSeqVars drops genomeFasta/consensusFasta inputs and adds transcriptDb/indelDb (placeholder channels pending upstream wiring). The Dockerfile adds Julia 1.10.8 with SQLite.jl precompiled.

Includes julia-rewrite-plan.md documenting the architecture and design decisions.

- Update bwaIndex process to use bwa index instead of hisat2-build - Update bwaMem process to use bwa mem with read group tags - Remove HISAT2-specific flags and quality encoding logic - Update workflow to use new BWA-MEM processes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Replace mpileup + varscan processes with single freebayes process - FreeBayes outputs single VCF which is split into SNPs and indels - Rename processes for genericity (makeCombinedVarscanIndex → makeCombinedVariantIndex) - Update all workflow references from varscan to freebayes - Update configuration: varscanPValue/varscanMinVarFreq → freebayesMinAltFraction - Rename varscan_directory → coverage_directory throughout - Update processSequenceVariationsNew.pl to use coverage_directory flag - Update BWA-related config (hisat2Threads → bwaThreads, hisat2Index → bwaIndex) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Replace bin/processSequenceVariationsNew.pl with bin/processSequenceVariations.jl. Sequence access moves from per-call samtools faidx subprocess spawns to a single SQLite bulk-fetch per transcript cached as an in-memory Dict. The O(N²) sed-based line reading in getVariations is replaced with a single-pass sorted merge over two open file handles. CDS lookup switches from a linear scan of all transcripts to binary search on a sorted interval array. Per-strain coordinate shifting is eliminated — the upstream transcript-prep process pre-splices CDS sequences into transcript coordinates, and a lightweight indel-shift SQL query handles position adjustment. Frameshift detection is precomputed at startup from the indels DB. Nextflow changes: processSeqVars drops genomeFasta/consensusFasta inputs and adds transcriptDb/indelDb (placeholder channels pending upstream wiring). The Dockerfile adds Julia 1.10.8 with SQLite.jl precompiled. Includes julia-rewrite-plan.md documenting the architecture and design decisions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

jbrestel · 2026-02-01T22:45:43Z

This is a WIP. Not ready for review but the "md" file here is worth a look to get a sense for what we're doing

Replace VarScan with FreeBayes and add BWA-MEM to match the updated Nextflow workflow. Also update the snpEff download URL from the deprecated Azure blob storage to the current AWS S3 location. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update the Julia download URL from the deprecated julialang-releases.github.io domain to the official julialang-s3.julialang.org endpoint. Also bump Julia version to 1.10.10 (latest LTS). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…modular functions Reduced main() from 512 lines to 38 lines by extracting focused helper functions. New data structures: - ProcessingContext: bundles all read-only reference data - OutputWriters: encapsulates output file handles - PositionAnnotation: groups annotation data for a position - TranscriptSequenceCache: manages transcript sequence caching New functions (17 total): - Resource management (5): init/close context, open/close writers, finalize files - Position processing (3): determine next position, collect variations, check variation - Annotation logic (4): annotate position, annotate variations, build reference, fill gaps - Output writing (3): write cache, SNP feature, allele/product files - Main loop (2): process single position, process all positions Benefits: - Readability: main() is now a clear 10-step pipeline - Maintainability: each function has a single responsibility - Testability: functions can be unit tested independently - Performance: no impact - same algorithms, just reorganized - Functional equivalence: all original logic preserved Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Adds toggleable debug output via --debug flag to track processing pipeline stages including GTF parsing, frameshift computation, per-position processing, transcript loading, and output writing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

… for bcftools consensus, and then masking later.

Simplifies input handling by replacing multiple input methods (BAM files, local FASTQ discovery, SRA downloads) with a single standardized CSV samplesheet format following nf-core conventions. Key changes: - Add samplesheet parser that auto-detects paired-end vs single-end reads - Remove download processes (downloadBAMFromEBI, downloadFiles) - Remove obsolete parameters: fromBAM, local, isPaired, createIndex, ebiFtpUser, ebiFtpPassword, organismAbbrev, bwaIndex - Simplify bwaIndex process to always create index from genome FASTA - Update process signatures to remove fromBAM checks - Add CLAUDE.md documentation with samplesheet format examples - Include test samplesheets (samplesheet_chr1.csv, samplesheet_mixed.csv) Net result: -109 lines of code, clearer separation of concerns, better alignment with nf-core ecosystem standards. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

… into refactor-01-26

…ve redundant concat/index/filter steps - freebayes now outputs ${sampleName}.vcf.gz (unfiltered, sample-named for merge uniqueness) alongside the existing split snps/indels VCFs - Remove concatSnpsAndIndels, makeCombinedVariantIndex, filterIndels processes; downstream steps use the unfiltered VCF directly via a channel map - makeIndelTSV wired to freebayes.indels.vcf.gz, bypassing the redundant vcftools filter step - Update makeSnpDensity and getHeterozygousSNPs input tuples to include unfiltered VCF fields; remove stale genomeMaskedFasta from getHeterozygousSNPs input Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s, remove stale publishDirs - Add normaliseCoverageToBigWig as a dedicated process in cnv.nf publishing to outputDir/CNVs as ${sampleName}_normalisedCoverage.bw, replacing the alias of bedGraphToBigWig that caused a name collision with the raw coverage bigwig - Remove dead publishDir from freebayes (coverage.txt was never generated) - Remove publishDir from gatk (BAM/BAI no longer published as pipeline artifacts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s, fix Dockerfile - Remove --gvcf from freebayes (crashes with ploidy=1); add dedicated bcftoolsMpileupGvcf process for per-base coverage gVCF generation - Add mergeGvcfs process to combine per-sample coverage gVCFs - Fix findValues.pl to decompress gzipped indel VCF via zcat - Merge per-sample indel TSVs into single indels.tsv using collectFile - Remove bin/* COPY from Dockerfile; Nextflow mounts bin/ at runtime Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… file params, remove GUS load steps - Replace inputDir/coverageFilePath with vcfFiles, gVcfFiles, indelsFiles, relativeConsensusFilePattern - Rename cacheFile -> vcfCacheFile; remove cacheFileDir and other legacy params - Remove addFeatureIdsToVariation, insertVariation, insertProduct, insertAllele processes - Remove BAM/BigWig/coverage trigger channels from processSeqVars - Add gvcfs_qch to mergeExperiments workflow take block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove makeSnpFile; processSeqVars now takes merged VCF directly (--vcf_file) - Add makeGenomicIndelDb process: loads per-strain genomic indels TSV into SQLite - Add makeCodingData process + bin/makeCodingData.jl: splices CDS sequences per strain from consensus FASTAs + GTF, projects genomic indels to CDS coords, outputs codingSequences.db and codingIndels.db - Add bin/GtfUtils.jl: shared GTF parsing, CDS interval binary search, position_in_cds, and IUPAC-aware reverse_complement - Update julia-rewrite-plan.md: rename transcript -> codingSequence throughout, bring CDS-prep pipeline in scope, document new processes - Add sqlite3 to Dockerfile apt-get install - Add Julia tests: testing/t/GtfUtils.jl and testing/t/makeCodingData.jl Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Build intervals array after exon_number reassignment in parse_gtf so intervals and by_transcript are consistent; fixes position_in_cds returning total CDS length instead of 0 for the first exon when exon_number is absent - Guard main() in makeCodingData.jl with PROGRAM_FILE check so the file can be included in tests without triggering execution - Rewrite makeCodingData test to include bin files directly (single type namespace) instead of a wrapper module, eliminating CdsExon type conflicts - All 83 GtfUtils tests and 24 makeCodingData tests now pass Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…pleToDefline - mergeVcfs: drop publishDir, use bcftools index instead of tabix round-trip, fix vcfFiles/gVcfFiles globs to match only .vcf.gz (not .tbi) - mergeExperiments workflow: alias mergeVcfs as mergeGvcfs for gVCF merging, branch on single/multiple files to skip merge when only one input - bcftoolsConsensusAndMask: bgzip output as ${sampleName}_consensus.fa.gz, absorb publishDir from removed addSampleToDefline process - Remove addSampleToDefline process, include, call, and bin/addSampleToDefline.pl; defline renaming was causing seq ID mismatch in makeCodingData Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jbrestel and others added 5 commits January 24, 2026 16:28

remove templates and whole workflow

78d28b8

reorganize modules and workflows

bb82cf6

jbrestel requested review from kathryncrouch and rdemko2332 February 1, 2026 22:45

jbrestel and others added 22 commits February 1, 2026 22:59

Merge branch 'refactor-01-26' into merge-experiments-refactor

061af27

ProcessSingleExperiment functional. Using unmasked reference sequence…

6ef9788

… for bcftools consensus, and then masking later.

Merge branch 'refactor-01-26' into merge-experiments-refactor

31b8246

Updating base image

5fa4d3e

wip

09df4f8

Adding forward slash to ADD line for perl

cb3e6de

Resolving perl module issues in workflow and container

042fde5

remove subsampling and better documentation for some alignment methods

2d1c718

merge

fcc67c3

remove the samtools depth step

dbc092d

Adding new freebayes argument

5f5c470

Removing unneeded output declaration from freebayes

b0b9359

add stats

0b249a6

Merge branch 'refactor-01-26' of github.com:VEuPathDB/dnaseq-nextflow…

6b3d9fb

… into refactor-01-26

no chunk for vcf

3b451ef

jbrestel and others added 7 commits February 27, 2026 10:33

vcf file for indels is compressed

41804ad

do not publish some extra files. reorganize some outputs

9c91fc4

Merge branch 'refactor-01-26' into merge-experiments-refactor

42d51fe

Merge branch 'main' into refactor-01-26

3e2a886

Merge branch 'refactor-01-26' into merge-experiments-refactor

cd1c6b4

Merge branch 'main' into merge-experiments-refactor

2e83b99

jbrestel changed the base branch from refactor-01-26 to main March 3, 2026 15:40

jbrestel and others added 8 commits March 3, 2026 22:25

update claude.md

08de997

gu+r

0b1508a

Resolving nextflow and julia interaction issues

ead1cd8

Removing unneeded data files

cb5c18c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite processSequenceVariations#4

Rewrite processSequenceVariations#4
jbrestel wants to merge 42 commits intomainfrom
merge-experiments-refactor

jbrestel commented Feb 1, 2026

Uh oh!

jbrestel commented Feb 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jbrestel commented Feb 1, 2026

Uh oh!

jbrestel commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jbrestel commented Feb 1, 2026 •

edited

Loading