Draft
Conversation
- Update bwaIndex process to use bwa index instead of hisat2-build - Update bwaMem process to use bwa mem with read group tags - Remove HISAT2-specific flags and quality encoding logic - Update workflow to use new BWA-MEM processes Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Replace mpileup + varscan processes with single freebayes process - FreeBayes outputs single VCF which is split into SNPs and indels - Rename processes for genericity (makeCombinedVarscanIndex → makeCombinedVariantIndex) - Update all workflow references from varscan to freebayes - Update configuration: varscanPValue/varscanMinVarFreq → freebayesMinAltFraction - Rename varscan_directory → coverage_directory throughout - Update processSequenceVariationsNew.pl to use coverage_directory flag - Update BWA-related config (hisat2Threads → bwaThreads, hisat2Index → bwaIndex) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace bin/processSequenceVariationsNew.pl with bin/processSequenceVariations.jl. Sequence access moves from per-call samtools faidx subprocess spawns to a single SQLite bulk-fetch per transcript cached as an in-memory Dict. The O(N²) sed-based line reading in getVariations is replaced with a single-pass sorted merge over two open file handles. CDS lookup switches from a linear scan of all transcripts to binary search on a sorted interval array. Per-strain coordinate shifting is eliminated — the upstream transcript-prep process pre-splices CDS sequences into transcript coordinates, and a lightweight indel-shift SQL query handles position adjustment. Frameshift detection is precomputed at startup from the indels DB. Nextflow changes: processSeqVars drops genomeFasta/consensusFasta inputs and adds transcriptDb/indelDb (placeholder channels pending upstream wiring). The Dockerfile adds Julia 1.10.8 with SQLite.jl precompiled. Includes julia-rewrite-plan.md documenting the architecture and design decisions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Member
Author
|
This is a WIP. Not ready for review but the "md" file here is worth a look to get a sense for what we're doing |
Replace VarScan with FreeBayes and add BWA-MEM to match the updated Nextflow workflow. Also update the snpEff download URL from the deprecated Azure blob storage to the current AWS S3 location. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update the Julia download URL from the deprecated julialang-releases.github.io domain to the official julialang-s3.julialang.org endpoint. Also bump Julia version to 1.10.10 (latest LTS). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…modular functions Reduced main() from 512 lines to 38 lines by extracting focused helper functions. New data structures: - ProcessingContext: bundles all read-only reference data - OutputWriters: encapsulates output file handles - PositionAnnotation: groups annotation data for a position - TranscriptSequenceCache: manages transcript sequence caching New functions (17 total): - Resource management (5): init/close context, open/close writers, finalize files - Position processing (3): determine next position, collect variations, check variation - Annotation logic (4): annotate position, annotate variations, build reference, fill gaps - Output writing (3): write cache, SNP feature, allele/product files - Main loop (2): process single position, process all positions Benefits: - Readability: main() is now a clear 10-step pipeline - Maintainability: each function has a single responsibility - Testability: functions can be unit tested independently - Performance: no impact - same algorithms, just reorganized - Functional equivalence: all original logic preserved Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds toggleable debug output via --debug flag to track processing pipeline stages including GTF parsing, frameshift computation, per-position processing, transcript loading, and output writing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
… for bcftools consensus, and then masking later.
Simplifies input handling by replacing multiple input methods (BAM files, local FASTQ discovery, SRA downloads) with a single standardized CSV samplesheet format following nf-core conventions. Key changes: - Add samplesheet parser that auto-detects paired-end vs single-end reads - Remove download processes (downloadBAMFromEBI, downloadFiles) - Remove obsolete parameters: fromBAM, local, isPaired, createIndex, ebiFtpUser, ebiFtpPassword, organismAbbrev, bwaIndex - Simplify bwaIndex process to always create index from genome FASTA - Update process signatures to remove fromBAM checks - Add CLAUDE.md documentation with samplesheet format examples - Include test samplesheets (samplesheet_chr1.csv, samplesheet_mixed.csv) Net result: -109 lines of code, clearer separation of concerns, better alignment with nf-core ecosystem standards. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
… into refactor-01-26
…ve redundant concat/index/filter steps
- freebayes now outputs ${sampleName}.vcf.gz (unfiltered, sample-named for merge uniqueness)
alongside the existing split snps/indels VCFs
- Remove concatSnpsAndIndels, makeCombinedVariantIndex, filterIndels processes;
downstream steps use the unfiltered VCF directly via a channel map
- makeIndelTSV wired to freebayes.indels.vcf.gz, bypassing the redundant vcftools filter step
- Update makeSnpDensity and getHeterozygousSNPs input tuples to include unfiltered VCF fields;
remove stale genomeMaskedFasta from getHeterozygousSNPs input
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s, remove stale publishDirs
- Add normaliseCoverageToBigWig as a dedicated process in cnv.nf publishing to
outputDir/CNVs as ${sampleName}_normalisedCoverage.bw, replacing the alias of
bedGraphToBigWig that caused a name collision with the raw coverage bigwig
- Remove dead publishDir from freebayes (coverage.txt was never generated)
- Remove publishDir from gatk (BAM/BAI no longer published as pipeline artifacts)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s, fix Dockerfile - Remove --gvcf from freebayes (crashes with ploidy=1); add dedicated bcftoolsMpileupGvcf process for per-base coverage gVCF generation - Add mergeGvcfs process to combine per-sample coverage gVCFs - Fix findValues.pl to decompress gzipped indel VCF via zcat - Merge per-sample indel TSVs into single indels.tsv using collectFile - Remove bin/* COPY from Dockerfile; Nextflow mounts bin/ at runtime Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… file params, remove GUS load steps - Replace inputDir/coverageFilePath with vcfFiles, gVcfFiles, indelsFiles, relativeConsensusFilePattern - Rename cacheFile -> vcfCacheFile; remove cacheFileDir and other legacy params - Remove addFeatureIdsToVariation, insertVariation, insertProduct, insertAllele processes - Remove BAM/BigWig/coverage trigger channels from processSeqVars - Add gvcfs_qch to mergeExperiments workflow take block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove makeSnpFile; processSeqVars now takes merged VCF directly (--vcf_file) - Add makeGenomicIndelDb process: loads per-strain genomic indels TSV into SQLite - Add makeCodingData process + bin/makeCodingData.jl: splices CDS sequences per strain from consensus FASTAs + GTF, projects genomic indels to CDS coords, outputs codingSequences.db and codingIndels.db - Add bin/GtfUtils.jl: shared GTF parsing, CDS interval binary search, position_in_cds, and IUPAC-aware reverse_complement - Update julia-rewrite-plan.md: rename transcript -> codingSequence throughout, bring CDS-prep pipeline in scope, document new processes - Add sqlite3 to Dockerfile apt-get install - Add Julia tests: testing/t/GtfUtils.jl and testing/t/makeCodingData.jl Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Build intervals array after exon_number reassignment in parse_gtf so intervals and by_transcript are consistent; fixes position_in_cds returning total CDS length instead of 0 for the first exon when exon_number is absent - Guard main() in makeCodingData.jl with PROGRAM_FILE check so the file can be included in tests without triggering execution - Rewrite makeCodingData test to include bin files directly (single type namespace) instead of a wrapper module, eliminating CdsExon type conflicts - All 83 GtfUtils tests and 24 makeCodingData tests now pass Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pleToDefline
- mergeVcfs: drop publishDir, use bcftools index instead of tabix round-trip,
fix vcfFiles/gVcfFiles globs to match only .vcf.gz (not .tbi)
- mergeExperiments workflow: alias mergeVcfs as mergeGvcfs for gVCF merging,
branch on single/multiple files to skip merge when only one input
- bcftoolsConsensusAndMask: bgzip output as ${sampleName}_consensus.fa.gz,
absorb publishDir from removed addSampleToDefline process
- Remove addSampleToDefline process, include, call, and bin/addSampleToDefline.pl;
defline renaming was causing seq ID mismatch in makeCodingData
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replace bin/processSequenceVariationsNew.pl with bin/processSequenceVariations.jl. Sequence access moves from per-call samtools faidx subprocess spawns to a single SQLite bulk-fetch per transcript cached as an in-memory Dict. The O(N²) sed-based line reading in getVariations is replaced with a single-pass sorted merge over two open file handles. CDS lookup switches from a linear scan of all transcripts to binary search on a sorted interval array. Per-strain coordinate shifting is eliminated — the upstream transcript-prep process pre-splices CDS sequences into transcript coordinates, and a lightweight indel-shift SQL query handles position adjustment. Frameshift detection is precomputed at startup from the indels DB.
Nextflow changes: processSeqVars drops genomeFasta/consensusFasta inputs and adds transcriptDb/indelDb (placeholder channels pending upstream wiring). The Dockerfile adds Julia 1.10.8 with SQLite.jl precompiled.
Includes julia-rewrite-plan.md documenting the architecture and design decisions.