Add deacon-based host read scrubbing by nrminor · Pull Request #14 · dholab/nvd

nrminor · 2026-02-12T14:25:45Z

Summary

This PR replaces STAT-based host read scrubbing in the preprocessing workflow with deacon, a fast alignment-free decontamination tool. The clumpify/SRA submission workflow is unchanged and continues to use STAT.

Deacon solves three problems with the current STAT approach:

Header preservation. Deacon passes FASTQ records through unmodified, preserving CASAVA 1.8+ headers. This is required for repair.sh to re-pair reads after filtering, which in turn is required for SPAdes paired-end assembly.
Composable indexes. Deacon's set algebra (deacon index union/diff/intersect) lets us build an nvd-specific contaminant index by combining the panhuman-1 human pangenome with discovered false positives, without rebuilding from scratch.
Speed. SIMD-accelerated minimizer matching at gigabases/second with ~5GB RAM for the panhuman index, versus STAT's two-pass approach (FASTA conversion → alignment → grep).

Structure

The PR is split into two commits to make review and bisection easier:

Commit 1: Add deacon infrastructure. Creates modules/deacon.nf (four processes: DEACON_BUILD_INDEX, DEACON_FETCH_INDEX, DEACON_UNION_INDEXES, DEACON_DEPLETE) and subworkflows/host_depletion.nf. Adds deacon parameters to nextflow.config, NvdParams, the CLI, the JSON schema, and the CI validation script. Adds deacon as a pixi dependency. At this point the code exists but nothing calls it.

Commit 2: Wire it up. A single-file change to workflows/preprocess_reads.nf that swaps the SCRUB_HOST_READS import for HOST_DEPLETION and replaces the STAT scrub block with the deacon equivalent. The existing scrub_host_reads param continues to gate the step — no new toggle was introduced.

Architectural decisions

The HOST_DEPLETION subworkflow uses declarative channel ternaries and empty-channel gating rather than procedural if/else blocks. This ensures Nextflow constructs the same DAG structure regardless of which params are set, which matters for -resume cache consistency. For example, DEACON_FETCH_INDEX takes a val url input channel instead of being a zero-input process, so when the URL channel is empty the process sits in the DAG but never executes — no when: guard needed.

Index union routing is determined at parse time via a def needs_union check on params, avoiding runtime .branch/.size()/.map gymnastics that are fragile in Groovy's type system. DEACON_UNION_INDEXES always receives its input via .collect(), but its output is only used when needs_union is true. The final ch_index is converted to a value channel via .first() so it can safely .combine() with every sample without being consumed.

New parameters

The scrub_host_reads toggle is reused — no new boolean param. The deacon-specific params control how scrubbing is performed:

Parameter	Default	CLI flag	Description
`deacon_index`	`null`	`--deacon-index`	Path to a prebuilt `.idx` file
`deacon_index_url`	panhuman-1 Zenodo URL	`--deacon-index-url`	URL to download a prebuilt index
`deacon_contaminants_fasta`	`null`	`--deacon-contaminants-fasta`	Additional contaminant FASTA to union with the base index
`deacon_kmer_size`	`31`	params-file only	K-mer length for index building
`deacon_window_size`	`15`	params-file only	Minimizer window size
`deacon_abs_threshold`	`2`	params-file only	Minimum absolute minimizer hits
`deacon_rel_threshold`	`0.01`	params-file only	Minimum relative minimizer proportion

What's not included

The clumpify/SRA workflow (workflows/clumpify.nf) is untouched. SCRUB_HUMAN_READS and sra_human_db remain as-is.
No documentation updates beyond code comments. Docs will be updated in a follow-up.
The SCRUB_HOST_READS process remains in modules/stat.nf but is no longer imported by the preprocessing workflow.

The BLAST workflow currently .collect()s all per-sample prepared CSVs into a single list and hands them to one LABKEY_UPLOAD_BLAST and one LABKEY_UPLOAD_FASTA process invocation. Those processes discover files via os.listdir() and loop over them. This means no upload can start until every sample finishes preparation, and a failure in any batch fails the entire upload. The GOTTCHA2 workflow already uses the eager pattern: each upload process receives a per-sample queue channel tuple, fires as soon as that sample is ready, and emits its own log. Logs are .mix()ed together for downstream gating. This commit brings BLAST uploads in line with that pattern. The change is entirely in the Nextflow wiring (bundle_blast_for_labkey.nf and stat_blast_workflow.nf). The Python upload scripts are unchanged — they already handle the single-file case correctly because their os.listdir() loop naturally finds one file when Nextflow stages one file.

The v2.4.0 schema was released with that tag and should not be modified. New params from feature branches (deacon, GOTTCHA2, etc.) need a new schema version to land in.

Introduces deacon modules, host depletion subworkflow, config params, Pydantic model fields, CLI options, JSON schema, and publishDir entry. Not yet wired into the preprocessing workflow.

Replaces STAT-based SCRUB_HOST_READS with deacon-based HOST_DEPLETION in preprocess_reads.nf. The scrub_host_reads param continues to gate the step; deacon is used when any deacon index config is available. The host_depletion subworkflow uses declarative channel ternaries and empty-channel gating rather than procedural if/else blocks. This ensures Nextflow constructs the same DAG structure regardless of which params are set, which matters for -resume cache consistency and makes the dataflow easier to reason about. Processes that aren't needed receive Channel.empty() inputs and simply don't execute, rather than being conditionally excluded from the DAG. Index union routing is determined at parse time from params via a def, avoiding runtime .branch/.size()/.map gymnastics that are fragile in Groovy's type system.

nrminor force-pushed the deacon-impl branch from 3e6d60d to b507772 Compare February 12, 2026 14:33

nrminor requested a review from wkgardner February 12, 2026 14:35

nrminor mentioned this pull request Feb 12, 2026

Open v2.5.0 params schema for development #15

Open

nrminor force-pushed the deacon-impl branch from b507772 to 13d947d Compare February 12, 2026 14:57

nrminor added 4 commits February 12, 2026 21:54

Open v2.5.0 params schema for development

d36cdc3

The v2.4.0 schema was released with that tag and should not be modified. New params from feature branches (deacon, GOTTCHA2, etc.) need a new schema version to land in.

Add deacon infrastructure for host read scrubbing

4d6dcbd

Introduces deacon modules, host depletion subworkflow, config params, Pydantic model fields, CLI options, JSON schema, and publishDir entry. Not yet wired into the preprocessing workflow.

nrminor force-pushed the deacon-impl branch from 13d947d to ab46d29 Compare February 13, 2026 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deacon-based host read scrubbing#14

Add deacon-based host read scrubbing#14
nrminor wants to merge 4 commits intomainfrom
deacon-impl

nrminor commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nrminor commented Feb 12, 2026

Summary

Structure

Architectural decisions

New parameters

What's not included

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant