Skip to content

Add deacon-based host read scrubbing#14

Open
nrminor wants to merge 4 commits intomainfrom
deacon-impl
Open

Add deacon-based host read scrubbing#14
nrminor wants to merge 4 commits intomainfrom
deacon-impl

Conversation

@nrminor
Copy link
Member

@nrminor nrminor commented Feb 12, 2026

Summary

This PR replaces STAT-based host read scrubbing in the preprocessing workflow with deacon, a fast alignment-free decontamination tool. The clumpify/SRA submission workflow is unchanged and continues to use STAT.

Deacon solves three problems with the current STAT approach:

  • Header preservation. Deacon passes FASTQ records through unmodified, preserving CASAVA 1.8+ headers. This is required for repair.sh to re-pair reads after filtering, which in turn is required for SPAdes paired-end assembly.
  • Composable indexes. Deacon's set algebra (deacon index union/diff/intersect) lets us build an nvd-specific contaminant index by combining the panhuman-1 human pangenome with discovered false positives, without rebuilding from scratch.
  • Speed. SIMD-accelerated minimizer matching at gigabases/second with ~5GB RAM for the panhuman index, versus STAT's two-pass approach (FASTA conversion → alignment → grep).

Structure

The PR is split into two commits to make review and bisection easier:

Commit 1: Add deacon infrastructure. Creates modules/deacon.nf (four processes: DEACON_BUILD_INDEX, DEACON_FETCH_INDEX, DEACON_UNION_INDEXES, DEACON_DEPLETE) and subworkflows/host_depletion.nf. Adds deacon parameters to nextflow.config, NvdParams, the CLI, the JSON schema, and the CI validation script. Adds deacon as a pixi dependency. At this point the code exists but nothing calls it.

Commit 2: Wire it up. A single-file change to workflows/preprocess_reads.nf that swaps the SCRUB_HOST_READS import for HOST_DEPLETION and replaces the STAT scrub block with the deacon equivalent. The existing scrub_host_reads param continues to gate the step — no new toggle was introduced.

Architectural decisions

The HOST_DEPLETION subworkflow uses declarative channel ternaries and empty-channel gating rather than procedural if/else blocks. This ensures Nextflow constructs the same DAG structure regardless of which params are set, which matters for -resume cache consistency. For example, DEACON_FETCH_INDEX takes a val url input channel instead of being a zero-input process, so when the URL channel is empty the process sits in the DAG but never executes — no when: guard needed.

Index union routing is determined at parse time via a def needs_union check on params, avoiding runtime .branch/.size()/.map gymnastics that are fragile in Groovy's type system. DEACON_UNION_INDEXES always receives its input via .collect(), but its output is only used when needs_union is true. The final ch_index is converted to a value channel via .first() so it can safely .combine() with every sample without being consumed.

New parameters

The scrub_host_reads toggle is reused — no new boolean param. The deacon-specific params control how scrubbing is performed:

Parameter Default CLI flag Description
deacon_index null --deacon-index Path to a prebuilt .idx file
deacon_index_url panhuman-1 Zenodo URL --deacon-index-url URL to download a prebuilt index
deacon_contaminants_fasta null --deacon-contaminants-fasta Additional contaminant FASTA to union with the base index
deacon_kmer_size 31 params-file only K-mer length for index building
deacon_window_size 15 params-file only Minimizer window size
deacon_abs_threshold 2 params-file only Minimum absolute minimizer hits
deacon_rel_threshold 0.01 params-file only Minimum relative minimizer proportion

What's not included

  • The clumpify/SRA workflow (workflows/clumpify.nf) is untouched. SCRUB_HUMAN_READS and sra_human_db remain as-is.
  • No documentation updates beyond code comments. Docs will be updated in a follow-up.
  • The SCRUB_HOST_READS process remains in modules/stat.nf but is no longer imported by the preprocessing workflow.

The BLAST workflow currently .collect()s all per-sample prepared CSVs into a single list and hands them to one LABKEY_UPLOAD_BLAST and one LABKEY_UPLOAD_FASTA process invocation. Those processes discover files via os.listdir() and loop over them. This means no upload can start until every sample finishes preparation, and a failure in any batch fails the entire upload.

The GOTTCHA2 workflow already uses the eager pattern: each upload process receives a per-sample queue channel tuple, fires as soon as that sample is ready, and emits its own log. Logs are .mix()ed together for downstream gating.

This commit brings BLAST uploads in line with that pattern. The change is entirely in the Nextflow wiring (bundle_blast_for_labkey.nf and stat_blast_workflow.nf). The Python upload scripts are unchanged — they already handle the single-file case correctly because their os.listdir() loop naturally finds one file when Nextflow stages one file.
The v2.4.0 schema was released with that tag and should not be modified. New params from feature branches (deacon, GOTTCHA2, etc.) need a new schema version to land in.
Introduces deacon modules, host depletion subworkflow, config params, Pydantic model fields, CLI options, JSON schema, and publishDir entry. Not yet wired into the preprocessing workflow.
Replaces STAT-based SCRUB_HOST_READS with deacon-based HOST_DEPLETION in preprocess_reads.nf. The scrub_host_reads param continues to gate the step; deacon is used when any deacon index config is available.

The host_depletion subworkflow uses declarative channel ternaries and empty-channel gating rather than procedural if/else blocks. This ensures Nextflow constructs the same DAG structure regardless of which params are set, which matters for -resume cache consistency and makes the dataflow easier to reason about. Processes that aren't needed receive Channel.empty() inputs and simply don't execute, rather than being conditionally excluded from the DAG. Index union routing is determined at parse time from params via a def, avoiding runtime .branch/.size()/.map gymnastics that are fragile in Groovy's type system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant