Allowing DNA and RNA in ProForma

This issue is based on a discussion in the PSI meeting last week.

## Problem
Adding DNA and RNA to ProForma would enable easier writing and reporting of these molecules for MS experiments. For example this would allow the use of these in spectral libraries stored in the mzSpecLib format (as precursors, if needed mzPAF should be extended with DNA/RNA fragmentation naming). At a high level DNA, RNA, and proteins are similar: both build up of distinct building blocks in a linear chain with a defined backbone. So writing DNA in ProForma could be done like so: `ATGCTG[+34]A`. The basic sequence block should be changed from amino acid to base pair and the modifications changed to a database of base pair modifications. This would still allow for all complexity of ProForma (modifications of unknown position, labile, global, chimeric, cross-linking etc). One issue that is left then is to be able to tell apart DNA/RNA sequences from peptides, especially if both are allowed to be mixed in the same definition. This should be done with some kind of tag before the now called 'peptidoform'. One proposal is to use `<!DNA>`, `<!RNA>`, and `<!AA>`. Examples:
```
<!DNA>ATGTCATCGT
<!RNA>AUGUCAUCGU
<!DNA>ATCG[Formula:OH#XL1]T//<!AA>HYTGC[#XL1]R
```
To my eyes `<>` looks like a namespacy, global, meta tag thing, a bit like an HTML tag. 

This tag is unambiguously and easily parse-able, as the only current options for things starting with `<` are global modifications and global isotope replacements. Global modifications `<[mod]@location>` are easily distinguished by the use of the `[` and global isotope replacements always use a number (expect for `<D>` but that is also easily recognised). I would propose to add this tag right before the peptidoform name tag `<!DNA>(>My nice sequence 001)ATGCTAGT` which means that any global modification comes before. If this tag is missing the sequence defaults to a protein/peptide which would make this addition easily backwards compatible.

## Open questions
* _Is there a good database for DNA/RNA modifications?_
  Joshua: RNA: https://genesilico.pl/modomics/
* _What are all the ambiguous base pair names we allow?_
  https://www.insdc.org/submitting-standards/feature-table/#7.4.3, only thing is that in this table only T is allowed while allowing U might be easier to read and understand
* _How do we handle global modifications, as currently they only allow for amino acid locations?_
* _Is the tag `<!DNA>` good enough, or can we find a better one?_
* _What are better names for compound peptidoform ion/peptidoform ion/peptidoform now that the whole concepts of peptides is left behind?_
* _What to do about double stranded DNA/RNA? Also what to do for nonstandard pairing, or modifications on the other strand?_
* _Do we need specific notation for backbone/base/linker localisation of modifications?_
  We do not have this for amino acids, although the SMILES could provide a start for this. 
* _Do we need to be able to differentiate 5'-to-3' and 3'-to-5' strands?_
* _Having a DNA strand with some RNA bases and the other way around does happen, how do we handle this?_
  Having a modification for these cases might just work.
* _Is there a need for a different system for modifications of the linker?_
  This might be handled by the modification database already.

## Answered questions
* _Is it needed to mix peptide and base pair sequences in one definition?_
  Yes, apparently there are experiments conducted with DNA cross-linked to peptides. Allowing both to mix then is the most elegant way to allow all of that complexity. (For example: 10.1016/j.cell.2025.04.037)
* _Is there a need to tell apart RNA and DNA or do we need a single type to represent both?_
  Yes, RNA has a different sugar in the backbone. As well as the T and U base difference.
* _Is there a better name for `Protein` in the tag, a bit smaller or better fitting in this context?_
  Joshua: `<!AA>`

## Nice figures
<figure>
<img src="https://github.com/user-attachments/assets/2281c874-338f-4535-b3ac-8131cdbf65b2" width="150">
<figcaption>10.1016/j.mcpro.2024.100742</figcaption>
</figure>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing DNA and RNA in ProForma #18

Problem

Open questions

Answered questions

Nice figures

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allowing DNA and RNA in ProForma #18

Description

Problem

Open questions

Answered questions

Nice figures

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions