Skip to content

Split dedup into dedup_seq and dedup_pos#16

Open
nrminor wants to merge 1 commit intoschema-v2.5.0from
dedup-decomposition
Open

Split dedup into dedup_seq and dedup_pos#16
nrminor wants to merge 1 commit intoschema-v2.5.0from
dedup-decomposition

Conversation

@nrminor
Copy link
Member

@nrminor nrminor commented Feb 13, 2026

Summary

The pipeline currently has a single dedup parameter that controls two distinct deduplication strategies that operate at different stages:

  1. Sequence-based (clumpify): removes exact/near-exact duplicate reads based on sequence content, runs during read preprocessing
  2. Positional (samtools markdup): removes PCR/optical duplicates based on mapping position, runs after alignment in minimap2

This PR splits them into dedup_seq and dedup_pos so users can enable one without the other. The original dedup parameter is preserved as an umbrella that enables both, maintaining backward compatibility.

Resolution chain

Each flag follows the same precedence pattern used by the other preprocessing flags:

specific param ?: umbrella ?: master switch

Concretely:

  • dedup_seq ?: dedup ?: preprocess — gates clumpify in PREPROCESS_READS
  • dedup_pos ?: dedup ?: preprocess — gates samtools markdup in MAP_READS_TO_CONTIGS

So --dedup still turns both on, --preprocess still turns everything on, and the new flags provide fine-grained control.

Changes

Nextflow layer:

  • nextflow.config — added dedup_seq and dedup_pos params (both default null)
  • workflows/preprocess_reads.nf — resolution chain now checks dedup_seq first
  • modules/minimap2.nf — resolution chain now checks dedup_pos first

Python layer (CLI, model, schema):

  • lib/py_nvd/models.py — added dedup_seq and dedup_pos fields to NvdParams
  • schemas/nvd-params.v2.5.0.schema.json — added schema entries
  • lib/py_nvd/cli/commands/run.py — added --dedup-seq/--no-dedup-seq and --dedup-pos/--no-dedup-pos
  • lib/py_nvd/cli/commands/preset.py — same
  • lib/py_nvd/params.py — added to template generation list

Generated:

  • lib/py_nvd/_fingerprint.json — regenerated (nextflow.config changed)

@wkgardner
Copy link
Collaborator

The precedence order makes sense and won't mess with the backwards compatibility of others. I am glad that we now have a way to more finely tune the dedup level we are using.

It looks like using both dedup_seq and dedup_pos at the same time works the same as the original dedup and both ternary statements follow the correct resolution logic. :? Groovy baby!

@nrminor
Copy link
Member Author

nrminor commented Feb 13, 2026

Yes! Gotta love Groovy's elvis operator (:?)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants