Skip to content

Rewrite processSequenceVariations#4

Draft
jbrestel wants to merge 42 commits intomainfrom
merge-experiments-refactor
Draft

Rewrite processSequenceVariations#4
jbrestel wants to merge 42 commits intomainfrom
merge-experiments-refactor

Conversation

@jbrestel
Copy link
Member

@jbrestel jbrestel commented Feb 1, 2026

Replace bin/processSequenceVariationsNew.pl with bin/processSequenceVariations.jl. Sequence access moves from per-call samtools faidx subprocess spawns to a single SQLite bulk-fetch per transcript cached as an in-memory Dict. The O(N²) sed-based line reading in getVariations is replaced with a single-pass sorted merge over two open file handles. CDS lookup switches from a linear scan of all transcripts to binary search on a sorted interval array. Per-strain coordinate shifting is eliminated — the upstream transcript-prep process pre-splices CDS sequences into transcript coordinates, and a lightweight indel-shift SQL query handles position adjustment. Frameshift detection is precomputed at startup from the indels DB.

Nextflow changes: processSeqVars drops genomeFasta/consensusFasta inputs and adds transcriptDb/indelDb (placeholder channels pending upstream wiring). The Dockerfile adds Julia 1.10.8 with SQLite.jl precompiled.

Includes julia-rewrite-plan.md documenting the architecture and design decisions.

jbrestel and others added 5 commits January 24, 2026 16:28
- Update bwaIndex process to use bwa index instead of hisat2-build
- Update bwaMem process to use bwa mem with read group tags
- Remove HISAT2-specific flags and quality encoding logic
- Update workflow to use new BWA-MEM processes

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Replace mpileup + varscan processes with single freebayes process
- FreeBayes outputs single VCF which is split into SNPs and indels
- Rename processes for genericity (makeCombinedVarscanIndex → makeCombinedVariantIndex)
- Update all workflow references from varscan to freebayes
- Update configuration: varscanPValue/varscanMinVarFreq → freebayesMinAltFraction
- Rename varscan_directory → coverage_directory throughout
- Update processSequenceVariationsNew.pl to use coverage_directory flag
- Update BWA-related config (hisat2Threads → bwaThreads, hisat2Index → bwaIndex)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace bin/processSequenceVariationsNew.pl with bin/processSequenceVariations.jl.
Sequence access moves from per-call samtools faidx subprocess spawns to a single
SQLite bulk-fetch per transcript cached as an in-memory Dict. The O(N²) sed-based
line reading in getVariations is replaced with a single-pass sorted merge over two
open file handles. CDS lookup switches from a linear scan of all transcripts to
binary search on a sorted interval array. Per-strain coordinate shifting is
eliminated — the upstream transcript-prep process pre-splices CDS sequences into
transcript coordinates, and a lightweight indel-shift SQL query handles position
adjustment. Frameshift detection is precomputed at startup from the indels DB.

Nextflow changes: processSeqVars drops genomeFasta/consensusFasta inputs and adds
transcriptDb/indelDb (placeholder channels pending upstream wiring). The Dockerfile
adds Julia 1.10.8 with SQLite.jl precompiled.

Includes julia-rewrite-plan.md documenting the architecture and design decisions.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@jbrestel
Copy link
Member Author

jbrestel commented Feb 1, 2026

This is a WIP. Not ready for review but the "md" file here is worth a look to get a sense for what we're doing

jbrestel and others added 22 commits February 1, 2026 22:59
Replace VarScan with FreeBayes and add BWA-MEM to match the updated Nextflow workflow. Also update the snpEff download URL from the deprecated Azure blob storage to the current AWS S3 location.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update the Julia download URL from the deprecated julialang-releases.github.io domain to the official julialang-s3.julialang.org endpoint. Also bump Julia version to 1.10.10 (latest LTS).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…modular functions

Reduced main() from 512 lines to 38 lines by extracting focused helper functions.

New data structures:
- ProcessingContext: bundles all read-only reference data
- OutputWriters: encapsulates output file handles
- PositionAnnotation: groups annotation data for a position
- TranscriptSequenceCache: manages transcript sequence caching

New functions (17 total):
- Resource management (5): init/close context, open/close writers, finalize files
- Position processing (3): determine next position, collect variations, check variation
- Annotation logic (4): annotate position, annotate variations, build reference, fill gaps
- Output writing (3): write cache, SNP feature, allele/product files
- Main loop (2): process single position, process all positions

Benefits:
- Readability: main() is now a clear 10-step pipeline
- Maintainability: each function has a single responsibility
- Testability: functions can be unit tested independently
- Performance: no impact - same algorithms, just reorganized
- Functional equivalence: all original logic preserved

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds toggleable debug output via --debug flag to track processing pipeline stages including GTF parsing, frameshift computation, per-position processing, transcript loading, and output writing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
… for bcftools consensus, and then masking later.
Simplifies input handling by replacing multiple input methods (BAM files,
local FASTQ discovery, SRA downloads) with a single standardized CSV
samplesheet format following nf-core conventions.

Key changes:
- Add samplesheet parser that auto-detects paired-end vs single-end reads
- Remove download processes (downloadBAMFromEBI, downloadFiles)
- Remove obsolete parameters: fromBAM, local, isPaired, createIndex,
  ebiFtpUser, ebiFtpPassword, organismAbbrev, bwaIndex
- Simplify bwaIndex process to always create index from genome FASTA
- Update process signatures to remove fromBAM checks
- Add CLAUDE.md documentation with samplesheet format examples
- Include test samplesheets (samplesheet_chr1.csv, samplesheet_mixed.csv)

Net result: -109 lines of code, clearer separation of concerns, better
alignment with nf-core ecosystem standards.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ve redundant concat/index/filter steps

- freebayes now outputs ${sampleName}.vcf.gz (unfiltered, sample-named for merge uniqueness)
  alongside the existing split snps/indels VCFs
- Remove concatSnpsAndIndels, makeCombinedVariantIndex, filterIndels processes;
  downstream steps use the unfiltered VCF directly via a channel map
- makeIndelTSV wired to freebayes.indels.vcf.gz, bypassing the redundant vcftools filter step
- Update makeSnpDensity and getHeterozygousSNPs input tuples to include unfiltered VCF fields;
  remove stale genomeMaskedFasta from getHeterozygousSNPs input

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s, remove stale publishDirs

- Add normaliseCoverageToBigWig as a dedicated process in cnv.nf publishing to
  outputDir/CNVs as ${sampleName}_normalisedCoverage.bw, replacing the alias of
  bedGraphToBigWig that caused a name collision with the raw coverage bigwig
- Remove dead publishDir from freebayes (coverage.txt was never generated)
- Remove publishDir from gatk (BAM/BAI no longer published as pipeline artifacts)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jbrestel and others added 7 commits February 27, 2026 10:33
…s, fix Dockerfile

- Remove --gvcf from freebayes (crashes with ploidy=1); add dedicated
  bcftoolsMpileupGvcf process for per-base coverage gVCF generation
- Add mergeGvcfs process to combine per-sample coverage gVCFs
- Fix findValues.pl to decompress gzipped indel VCF via zcat
- Merge per-sample indel TSVs into single indels.tsv using collectFile
- Remove bin/* COPY from Dockerfile; Nextflow mounts bin/ at runtime

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jbrestel jbrestel changed the base branch from refactor-01-26 to main March 3, 2026 15:40
jbrestel and others added 8 commits March 3, 2026 22:25
… file params, remove GUS load steps

- Replace inputDir/coverageFilePath with vcfFiles, gVcfFiles, indelsFiles, relativeConsensusFilePattern
- Rename cacheFile -> vcfCacheFile; remove cacheFileDir and other legacy params
- Remove addFeatureIdsToVariation, insertVariation, insertProduct, insertAllele processes
- Remove BAM/BigWig/coverage trigger channels from processSeqVars
- Add gvcfs_qch to mergeExperiments workflow take block

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove makeSnpFile; processSeqVars now takes merged VCF directly (--vcf_file)
- Add makeGenomicIndelDb process: loads per-strain genomic indels TSV into SQLite
- Add makeCodingData process + bin/makeCodingData.jl: splices CDS sequences
  per strain from consensus FASTAs + GTF, projects genomic indels to CDS coords,
  outputs codingSequences.db and codingIndels.db
- Add bin/GtfUtils.jl: shared GTF parsing, CDS interval binary search,
  position_in_cds, and IUPAC-aware reverse_complement
- Update julia-rewrite-plan.md: rename transcript -> codingSequence throughout,
  bring CDS-prep pipeline in scope, document new processes
- Add sqlite3 to Dockerfile apt-get install
- Add Julia tests: testing/t/GtfUtils.jl and testing/t/makeCodingData.jl

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Build intervals array after exon_number reassignment in parse_gtf so
  intervals and by_transcript are consistent; fixes position_in_cds returning
  total CDS length instead of 0 for the first exon when exon_number is absent
- Guard main() in makeCodingData.jl with PROGRAM_FILE check so the file
  can be included in tests without triggering execution
- Rewrite makeCodingData test to include bin files directly (single type
  namespace) instead of a wrapper module, eliminating CdsExon type conflicts
- All 83 GtfUtils tests and 24 makeCodingData tests now pass

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pleToDefline

- mergeVcfs: drop publishDir, use bcftools index instead of tabix round-trip,
  fix vcfFiles/gVcfFiles globs to match only .vcf.gz (not .tbi)
- mergeExperiments workflow: alias mergeVcfs as mergeGvcfs for gVCF merging,
  branch on single/multiple files to skip merge when only one input
- bcftoolsConsensusAndMask: bgzip output as ${sampleName}_consensus.fa.gz,
  absorb publishDir from removed addSampleToDefline process
- Remove addSampleToDefline process, include, call, and bin/addSampleToDefline.pl;
  defline renaming was causing seq ID mismatch in makeCodingData

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants