Conversation
The BLAST workflow currently .collect()s all per-sample prepared CSVs into a single list and hands them to one LABKEY_UPLOAD_BLAST and one LABKEY_UPLOAD_FASTA process invocation. Those processes discover files via os.listdir() and loop over them. This means no upload can start until every sample finishes preparation, and a failure in any batch fails the entire upload. The GOTTCHA2 workflow already uses the eager pattern: each upload process receives a per-sample queue channel tuple, fires as soon as that sample is ready, and emits its own log. Logs are .mix()ed together for downstream gating. This commit brings BLAST uploads in line with that pattern. The change is entirely in the Nextflow wiring (bundle_blast_for_labkey.nf and stat_blast_workflow.nf). The Python upload scripts are unchanged — they already handle the single-file case correctly because their os.listdir() loop naturally finds one file when Nextflow stages one file.
The v2.4.0 schema was released with that tag and should not be modified. New params from feature branches (deacon, GOTTCHA2, etc.) need a new schema version to land in.
Introduces deacon modules, host depletion subworkflow, config params, Pydantic model fields, CLI options, JSON schema, and publishDir entry. Not yet wired into the preprocessing workflow.
Replaces STAT-based SCRUB_HOST_READS with deacon-based HOST_DEPLETION in preprocess_reads.nf. The scrub_host_reads param continues to gate the step; deacon is used when any deacon index config is available. The host_depletion subworkflow uses declarative channel ternaries and empty-channel gating rather than procedural if/else blocks. This ensures Nextflow constructs the same DAG structure regardless of which params are set, which matters for -resume cache consistency and makes the dataflow easier to reason about. Processes that aren't needed receive Channel.empty() inputs and simply don't execute, rather than being conditionally excluded from the DAG. Index union routing is determined at parse time from params via a def, avoiding runtime .branch/.size()/.map gymnastics that are fragile in Groovy's type system.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR replaces STAT-based host read scrubbing in the preprocessing workflow with deacon, a fast alignment-free decontamination tool. The clumpify/SRA submission workflow is unchanged and continues to use STAT.
Deacon solves three problems with the current STAT approach:
repair.shto re-pair reads after filtering, which in turn is required for SPAdes paired-end assembly.deacon index union/diff/intersect) lets us build an nvd-specific contaminant index by combining the panhuman-1 human pangenome with discovered false positives, without rebuilding from scratch.Structure
The PR is split into two commits to make review and bisection easier:
Commit 1: Add deacon infrastructure. Creates
modules/deacon.nf(four processes:DEACON_BUILD_INDEX,DEACON_FETCH_INDEX,DEACON_UNION_INDEXES,DEACON_DEPLETE) andsubworkflows/host_depletion.nf. Adds deacon parameters tonextflow.config,NvdParams, the CLI, the JSON schema, and the CI validation script. Addsdeaconas a pixi dependency. At this point the code exists but nothing calls it.Commit 2: Wire it up. A single-file change to
workflows/preprocess_reads.nfthat swaps theSCRUB_HOST_READSimport forHOST_DEPLETIONand replaces the STAT scrub block with the deacon equivalent. The existingscrub_host_readsparam continues to gate the step — no new toggle was introduced.Architectural decisions
The
HOST_DEPLETIONsubworkflow uses declarative channel ternaries and empty-channel gating rather than proceduralif/elseblocks. This ensures Nextflow constructs the same DAG structure regardless of which params are set, which matters for-resumecache consistency. For example,DEACON_FETCH_INDEXtakes aval urlinput channel instead of being a zero-input process, so when the URL channel is empty the process sits in the DAG but never executes — nowhen:guard needed.Index union routing is determined at parse time via a
def needs_unioncheck on params, avoiding runtime.branch/.size()/.mapgymnastics that are fragile in Groovy's type system.DEACON_UNION_INDEXESalways receives its input via.collect(), but its output is only used whenneeds_unionis true. The finalch_indexis converted to a value channel via.first()so it can safely.combine()with every sample without being consumed.New parameters
The
scrub_host_readstoggle is reused — no new boolean param. The deacon-specific params control how scrubbing is performed:deacon_indexnull--deacon-index.idxfiledeacon_index_url--deacon-index-urldeacon_contaminants_fastanull--deacon-contaminants-fastadeacon_kmer_size31deacon_window_size15deacon_abs_threshold2deacon_rel_threshold0.01What's not included
workflows/clumpify.nf) is untouched.SCRUB_HUMAN_READSandsra_human_dbremain as-is.SCRUB_HOST_READSprocess remains inmodules/stat.nfbut is no longer imported by the preprocessing workflow.