Skip to content

Unjoined pairs rescue#189

Open
A-meara wants to merge 7 commits intoqiime2:devfrom
A-meara:dev
Open

Unjoined pairs rescue#189
A-meara wants to merge 7 commits intoqiime2:devfrom
A-meara:dev

Conversation

@A-meara
Copy link
Copy Markdown

@A-meara A-meara commented Dec 19, 2025

Adds a --p-concat parameter to the denoise-paired command that rescues read pairs that fail to merge. When true, instead of discarding unmerged pairs, DADA2 concatenates forward and reverse reads with a 10-nucleotide spacer (NNNNNNNNNN) between them (but merges others as normal). This allows downstream kmer-based classifiers to generate kmers from both reads independently without creating artificial bridging kmers across the join. Outputs unmerged pairs as concatenated sequences, and returns them as new UnmergedPairs[Sequences] format which is digested by the classifier.

(rebased from latest version, will do the same for q2-feature-classifier)

@nbokulich
Copy link
Copy Markdown
Member

hi @A-meara , this is awesome thanks! a few small comments for discussion, and looks like you have some lint errors, please check. Once those tests pass, @ebolyen could you review the code or delegate when you get a chance? This can be slated for a release later in 2026 (i.e., not 2026.1, too soon).

  1. The new type would be a kind of FeatureData. Could we call this FeatureData[UnmergedDNASequence] or something like this? @ebolyen any thoughts on the semantics?
  2. instead of a 10N spacer, could we just insert a gap character? (-) The semantic type would ensure that this is handled appropriately, i.e., that the gap represents a gap of unknown length and that this is not an alignment. A 10N seems very unlikely to occur, usually this would be discarded anyway as a low-quality sequence in most workflows, but maybe this could crop up with OTU clustering or with some types of sequence data? Whereas a gap would never show up except in an alignment. Just trying to think of weird edge cases that could crop up in the future.

fixes #93
fixes #129
(effectively fixes both of these --> other steps needed to handle downstream, tbd in q2-feature-classifier, q2-kmerizer, q2-vsearch)

@A-meara
Copy link
Copy Markdown
Author

A-meara commented Dec 20, 2025

Hi @nbokulich, I will take a look at those errors. For:

  1. Certainly open to the best semantic naming we can come up with!
  2. If I recall correctly, it was easiest to use the Ns as they are supported by DNA formats already and the gap char is not. And if you take a look at the handling in the classifier (see 192) they are stripped and replaced with a blank space. So currently the only method that accepts this new semantic type automatically strips them when processing.

@ebolyen
Copy link
Copy Markdown
Member

ebolyen commented Dec 23, 2025

The new type would be a kind of FeatureData. Could we call this FeatureData[UnmergedDNASequence] or something like this? @ebolyen any thoughts on the semantics?

instead of a 10N spacer, could we just insert a gap character? (-) The semantic type would ensure that this is handled appropriately, i.e., that the gap represents a gap of unknown length and that this is not an alignment. A 10N seems very unlikely to occur, usually this would be discarded anyway as a low-quality sequence in most workflows, but maybe this could crop up with OTU clustering or with some types of sequence data? Whereas a gap would never show up except in an alignment. Just trying to think of weird edge cases that could crop up in the future.

Not to throw a wrench in anything, but there is a FeatureData[PairedEndSequence] type, which is not really any different in concept (it's just two fasta files in the same read-order).

It may be better to use this type as it will allow the downstream operation to decide how to represent the sequence without imposing any inline constraint on the alphabet. (e.g. a transformer is implemented which reads both records simultaneously and then puts a gap or spacer between them).

@ebolyen
Copy link
Copy Markdown
Member

ebolyen commented Dec 23, 2025

Looking at the corresponding PR in feature-classifier, I do think using PairedEndSequence is the way to go. It also makes the input type quite simple FeatureData[Sequence | PairedEndSequence].

It also means we don't need to do anything too special in the R script here, since we can just write the de-duplicated forwards and reverse sequences to their own inputs.

@nbokulich
Copy link
Copy Markdown
Member

hey @A-meara , I discussed offline with @ebolyen and will summarize some key points here.

The NNNNNNNNNN linker is problematic because if someone exports their data outside of QIIME 2 the type information is lost (and along with it the contextual information that is key to understanding that this is an arbitrary linker). A single gap is less problematic, but also could be misinterpreted that this represents an aligned sequence if the data are exported. For this reason we propose to use a single whitespace character " " as the spacer, as it would be meaningless after exporting and cause downstream tools to fail, but also allow streamlined processing within QIIME 2 (as in e.g., q2-feature-classifier we would want a spacer to separate sequence "words").

The type that @ebolyen suggested, FeatureData[PairedEndSequence], does not work because the output here will be a mixture of merged sequence pairs that do overlap, along with non-overlapping pairs that are rescued with the justConcatenate option. Hence the idea to place a spacer character where pairs are concatenated, so we have a way to store these in the same object with merged pairs but identify which sequences are concatenated.

However, this means that we really need to give the type a different name, since UnmergedPairs is not technically correct, as the output does also contain merged sequences. We are looking at something like a FeatureData[MergedAndConcatenatedSequencePairs]. Possibly a FeatureData[SequenceSalad] but that is probably not that much clearer, semantically speaking.

Regarding the spacer, this could probably be done by working with the data.frame output of the first pass of mergePairs. You are using the returnRejects = TRUE parameter, which reports unmerged pairs in the data.frame (correct?). Instead of doing a second pass of mergePairs with justConcatenate = True, you could just find the unmerged pairs in the data.frame and concatenate with the spacer of choice. For that matter, we could also consider exposing maxmismatch and other parameters to filter this table (to remove any sequence pairs that overlap but fail to merge due to mismatch issues), though that could be left as a future enhancement.

Any thoughts on this? We still need to land on a good name for the type (which will be some flavor of FeatureData), but the spacer issue was clear to us after some discussion.

@cherman2 cherman2 moved this from Needs Triage to Awaiting Info in QIIME 2 - Triage 🚑 Jan 7, 2026
@github-project-automation github-project-automation bot moved this to Backlog in 2026.4 🌱 Jan 30, 2026
@colinvwood colinvwood moved this from Backlog to In Development in 2026.4 🌱 Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Development

Development

Successfully merging this pull request may close these issues.

6 participants