-
Notifications
You must be signed in to change notification settings - Fork 5
Description
This issue is based on a discussion in the PSI meeting last week.
Problem
Adding DNA and RNA to ProForma would enable easier writing and reporting of these molecules for MS experiments. For example this would allow the use of these in spectral libraries stored in the mzSpecLib format (as precursors, if needed mzPAF should be extended with DNA/RNA fragmentation naming). At a high level DNA, RNA, and proteins are similar: both build up of distinct building blocks in a linear chain with a defined backbone. So writing DNA in ProForma could be done like so: ATGCTG[+34]A. The basic sequence block should be changed from amino acid to base pair and the modifications changed to a database of base pair modifications. This would still allow for all complexity of ProForma (modifications of unknown position, labile, global, chimeric, cross-linking etc). One issue that is left then is to be able to tell apart DNA/RNA sequences from peptides, especially if both are allowed to be mixed in the same definition. This should be done with some kind of tag before the now called 'peptidoform'. One proposal is to use <!DNA>, <!RNA>, and <!AA>. Examples:
<!DNA>ATGTCATCGT
<!RNA>AUGUCAUCGU
<!DNA>ATCG[Formula:OH#XL1]T//<!AA>HYTGC[#XL1]R
To my eyes <> looks like a namespacy, global, meta tag thing, a bit like an HTML tag.
This tag is unambiguously and easily parse-able, as the only current options for things starting with < are global modifications and global isotope replacements. Global modifications <[mod]@location> are easily distinguished by the use of the [ and global isotope replacements always use a number (expect for <D> but that is also easily recognised). I would propose to add this tag right before the peptidoform name tag <!DNA>(>My nice sequence 001)ATGCTAGT which means that any global modification comes before. If this tag is missing the sequence defaults to a protein/peptide which would make this addition easily backwards compatible.
Open questions
- Is there a good database for DNA/RNA modifications?
Joshua: RNA: https://genesilico.pl/modomics/ - What are all the ambiguous base pair names we allow?
https://www.insdc.org/submitting-standards/feature-table/#7.4.3, only thing is that in this table only T is allowed while allowing U might be easier to read and understand - How do we handle global modifications, as currently they only allow for amino acid locations?
- Is the tag
<!DNA>good enough, or can we find a better one? - What are better names for compound peptidoform ion/peptidoform ion/peptidoform now that the whole concepts of peptides is left behind?
- What to do about double stranded DNA/RNA? Also what to do for nonstandard pairing, or modifications on the other strand?
- Do we need specific notation for backbone/base/linker localisation of modifications?
We do not have this for amino acids, although the SMILES could provide a start for this. - Do we need to be able to differentiate 5'-to-3' and 3'-to-5' strands?
- Having a DNA strand with some RNA bases and the other way around does happen, how do we handle this?
Having a modification for these cases might just work. - Is there a need for a different system for modifications of the linker?
This might be handled by the modification database already.
Answered questions
- Is it needed to mix peptide and base pair sequences in one definition?
Yes, apparently there are experiments conducted with DNA cross-linked to peptides. Allowing both to mix then is the most elegant way to allow all of that complexity. (For example: 10.1016/j.cell.2025.04.037) - Is there a need to tell apart RNA and DNA or do we need a single type to represent both?
Yes, RNA has a different sugar in the backbone. As well as the T and U base difference. - Is there a better name for
Proteinin the tag, a bit smaller or better fitting in this context?
Joshua:<!AA>
Nice figures
10.1016/j.mcpro.2024.100742