Conversation
…rges and concats them with 10Ns in between
|
hi @A-meara , this is awesome thanks! a few small comments for discussion, and looks like you have some lint errors, please check. Once those tests pass, @ebolyen could you review the code or delegate when you get a chance? This can be slated for a release later in 2026 (i.e., not 2026.1, too soon).
fixes #93 |
|
Hi @nbokulich, I will take a look at those errors. For:
|
Not to throw a wrench in anything, but there is a It may be better to use this type as it will allow the downstream operation to decide how to represent the sequence without imposing any inline constraint on the alphabet. (e.g. a transformer is implemented which reads both records simultaneously and then puts a gap or spacer between them). |
|
Looking at the corresponding PR in feature-classifier, I do think using It also means we don't need to do anything too special in the R script here, since we can just write the de-duplicated forwards and reverse sequences to their own inputs. |
|
hey @A-meara , I discussed offline with @ebolyen and will summarize some key points here. The NNNNNNNNNN linker is problematic because if someone exports their data outside of QIIME 2 the type information is lost (and along with it the contextual information that is key to understanding that this is an arbitrary linker). A single gap is less problematic, but also could be misinterpreted that this represents an aligned sequence if the data are exported. For this reason we propose to use a single whitespace character " " as the spacer, as it would be meaningless after exporting and cause downstream tools to fail, but also allow streamlined processing within QIIME 2 (as in e.g., q2-feature-classifier we would want a spacer to separate sequence "words"). The type that @ebolyen suggested, However, this means that we really need to give the type a different name, since Regarding the spacer, this could probably be done by working with the data.frame output of the first pass of Any thoughts on this? We still need to land on a good name for the type (which will be some flavor of |
Adds a
--p-concatparameter to thedenoise-pairedcommand that rescues read pairs that fail to merge. When true, instead of discarding unmerged pairs, DADA2 concatenates forward and reverse reads with a 10-nucleotide spacer (NNNNNNNNNN) between them (but merges others as normal). This allows downstream kmer-based classifiers to generate kmers from both reads independently without creating artificial bridging kmers across the join. Outputs unmerged pairs as concatenated sequences, and returns them as newUnmergedPairs[Sequences]format which is digested by the classifier.(rebased from latest version, will do the same for q2-feature-classifier)