Unjoined pairs rescue by A-meara · Pull Request #189 · qiime2/q2-dada2

A-meara · 2025-12-19T12:58:03Z

Adds a --p-concat parameter to the denoise-paired command that rescues read pairs that fail to merge. When true, instead of discarding unmerged pairs, DADA2 concatenates forward and reverse reads with a 10-nucleotide spacer (NNNNNNNNNN) between them (but merges others as normal). This allows downstream kmer-based classifiers to generate kmers from both reads independently without creating artificial bridging kmers across the join. Outputs unmerged pairs as concatenated sequences, and returns them as new UnmergedPairs[Sequences] format which is digested by the classifier.

(rebased from latest version, will do the same for q2-feature-classifier)

…rges and concats them with 10Ns in between

nbokulich · 2025-12-20T14:52:40Z

hi @A-meara , this is awesome thanks! a few small comments for discussion, and looks like you have some lint errors, please check. Once those tests pass, @ebolyen could you review the code or delegate when you get a chance? This can be slated for a release later in 2026 (i.e., not 2026.1, too soon).

The new type would be a kind of FeatureData. Could we call this FeatureData[UnmergedDNASequence] or something like this? @ebolyen any thoughts on the semantics?
instead of a 10N spacer, could we just insert a gap character? (-) The semantic type would ensure that this is handled appropriately, i.e., that the gap represents a gap of unknown length and that this is not an alignment. A 10N seems very unlikely to occur, usually this would be discarded anyway as a low-quality sequence in most workflows, but maybe this could crop up with OTU clustering or with some types of sequence data? Whereas a gap would never show up except in an alignment. Just trying to think of weird edge cases that could crop up in the future.

fixes #93
fixes #129
(effectively fixes both of these --> other steps needed to handle downstream, tbd in q2-feature-classifier, q2-kmerizer, q2-vsearch)

A-meara · 2025-12-20T22:10:06Z

Hi @nbokulich, I will take a look at those errors. For:

Certainly open to the best semantic naming we can come up with!
If I recall correctly, it was easiest to use the Ns as they are supported by DNA formats already and the gap char is not. And if you take a look at the handling in the classifier (see 192) they are stripped and replaced with a blank space. So currently the only method that accepts this new semantic type automatically strips them when processing.

ebolyen · 2025-12-23T12:03:09Z

The new type would be a kind of FeatureData. Could we call this FeatureData[UnmergedDNASequence] or something like this? @ebolyen any thoughts on the semantics?

instead of a 10N spacer, could we just insert a gap character? (-) The semantic type would ensure that this is handled appropriately, i.e., that the gap represents a gap of unknown length and that this is not an alignment. A 10N seems very unlikely to occur, usually this would be discarded anyway as a low-quality sequence in most workflows, but maybe this could crop up with OTU clustering or with some types of sequence data? Whereas a gap would never show up except in an alignment. Just trying to think of weird edge cases that could crop up in the future.

Not to throw a wrench in anything, but there is a FeatureData[PairedEndSequence] type, which is not really any different in concept (it's just two fasta files in the same read-order).

It may be better to use this type as it will allow the downstream operation to decide how to represent the sequence without imposing any inline constraint on the alphabet. (e.g. a transformer is implemented which reads both records simultaneously and then puts a gap or spacer between them).

ebolyen · 2025-12-23T12:06:42Z

Looking at the corresponding PR in feature-classifier, I do think using PairedEndSequence is the way to go. It also makes the input type quite simple FeatureData[Sequence | PairedEndSequence].

It also means we don't need to do anything too special in the R script here, since we can just write the de-duplicated forwards and reverse sequences to their own inputs.

nbokulich · 2025-12-23T20:05:06Z

hey @A-meara , I discussed offline with @ebolyen and will summarize some key points here.

The NNNNNNNNNN linker is problematic because if someone exports their data outside of QIIME 2 the type information is lost (and along with it the contextual information that is key to understanding that this is an arbitrary linker). A single gap is less problematic, but also could be misinterpreted that this represents an aligned sequence if the data are exported. For this reason we propose to use a single whitespace character " " as the spacer, as it would be meaningless after exporting and cause downstream tools to fail, but also allow streamlined processing within QIIME 2 (as in e.g., q2-feature-classifier we would want a spacer to separate sequence "words").

The type that @ebolyen suggested, FeatureData[PairedEndSequence], does not work because the output here will be a mixture of merged sequence pairs that do overlap, along with non-overlapping pairs that are rescued with the justConcatenate option. Hence the idea to place a spacer character where pairs are concatenated, so we have a way to store these in the same object with merged pairs but identify which sequences are concatenated.

However, this means that we really need to give the type a different name, since UnmergedPairs is not technically correct, as the output does also contain merged sequences. We are looking at something like a FeatureData[MergedAndConcatenatedSequencePairs]. Possibly a FeatureData[SequenceSalad] but that is probably not that much clearer, semantically speaking.

Regarding the spacer, this could probably be done by working with the data.frame output of the first pass of mergePairs. You are using the returnRejects = TRUE parameter, which reports unmerged pairs in the data.frame (correct?). Instead of doing a second pass of mergePairs with justConcatenate = True, you could just find the unmerged pairs in the data.frame and concatenate with the spacer of choice. For that matter, we could also consider exposing maxmismatch and other parameters to filter this table (to remove any sequence pairs that overlap but fail to merge due to mismatch issues), though that could be left as a future enhancement.

Any thoughts on this? We still need to land on a good name for the type (which will be some flavor of FeatureData), but the spacer issue was clear to us after some discussion.

A-meara added 6 commits December 17, 2025 15:16

exposed concatenate option in denoise paired, which takes rejected me…

85fd7b1

…rges and concats them with 10Ns in between

changed output to UnmergedPairs with underlying FASTA format

ba072bd

working version - to be cleaned

902f6f1

incorporate test script into main script

78579ca

testing updates

17d9d52

adjusting tests for new error handling

c80f9b9

q2d2 added this to QIIME 2 - Triage 🚑 Dec 19, 2025

github-project-automation bot moved this to Needs Triage in QIIME 2 - Triage 🚑 Dec 19, 2025

A-meara mentioned this pull request Dec 20, 2025

Unmerged pairs handling qiime2/q2-feature-classifier#219

Open

formatting

25e1ec9

cherman2 moved this from Needs Triage to Awaiting Info in QIIME 2 - Triage 🚑 Jan 7, 2026

gregcaporaso assigned ebolyen and nbokulich Jan 8, 2026

colinvwood removed this from QIIME 2 - Triage 🚑 Jan 30, 2026

colinvwood added this to 2026.4 🌱 Jan 30, 2026

github-project-automation bot moved this to Backlog in 2026.4 🌱 Jan 30, 2026

colinvwood moved this from Backlog to In Development in 2026.4 🌱 Jan 30, 2026

ebolyen assigned colinvwood Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unjoined pairs rescue#189

Unjoined pairs rescue#189
A-meara wants to merge 7 commits intoqiime2:devfrom
A-meara:dev

A-meara commented Dec 19, 2025

Uh oh!

nbokulich commented Dec 20, 2025

Uh oh!

A-meara commented Dec 20, 2025

Uh oh!

ebolyen commented Dec 23, 2025

Uh oh!

ebolyen commented Dec 23, 2025

Uh oh!

nbokulich commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

A-meara commented Dec 19, 2025

Uh oh!

nbokulich commented Dec 20, 2025

Uh oh!

A-meara commented Dec 20, 2025

Uh oh!

ebolyen commented Dec 23, 2025

Uh oh!

ebolyen commented Dec 23, 2025

Uh oh!

nbokulich commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants