-
Notifications
You must be signed in to change notification settings - Fork 0
Description
@adf-ncgr said:
OK, I haven't actually gotten any farther on this, but it reminds about an issue I wanted to raise in general. Our current marker specification is all about the markers in the context of a genome. This is useful to be sure, but may be problematic in cases where markers are specified as flanking sequence and some of those sequences have not yet been anchored uniquely in a genome. For example (the ones that got me thinking about this when a user asked about alfalfa marker sequences):
https://alfalfatoolbox.org/filebrowser/download/175
https://alfalfatoolbox.org/filebrowser/download/176
Since mapping marker sequences to a genome "B" may not produce the same result as mapping marker sequences to genome "A" then projecting them to "B" by means of aligning A->B (especially for those that don't map to A in the first place), it may be useful to formalize a marker sequence fasta convention as well as the gff3 representation. We may already have some non-formal examples of this such as https://data.legumeinfo.org/Arachis/GENUS/markers/mixed.mrk.Axiom_Arachis_58K_SNP/
FWIW, I know that the cowpea group has been assiduously reviewing the mappings represented in https://data.legumeinfo.org/Vigna/unguiculata/markers/IT97K-499-35.gnm1.mrk.Cowpea1MSelectedSNPs/
using the flanking sequences from the chip design and will likely publish an updated version (though I'm pretty sure the ones I failed to find in the current mapping will still not be present, since I think they were not from the chip).
to which @cann0010 replied:
About formalizing "a marker sequence fasta convention as well as the gff3 representation": I agree that these should be accommodated, though I imagine this as an optional extra file type -- probably either just a standard fasta file with marker names as the fasta IDs, or with alleles specified with e.g. "[A/G] at the variant site. I believe we have one such marker file in the DS currently (I am surprised we don't have more).
Phaseolus/vulgaris/genetic/mixed.gen.Blair_Cortés_2018/phavu.mixed.gen.Blair_Cortés_2018.flanking_seq.fna.gz
to which @adf-ncgr replied:
Thanks @cann0010 ! I was also a bit surprised we don't have more, although I did find a couple of others using some find-based guesswork:
./Cajanus/cajan/markers/mixed.mrk.1drZ/cajca.mixed.mrk.1drZ.cajan_v2_primers.txt.gz
./Arachis/GENUS/markers/mixed.mrk.Axiom_Arachis_58K_SNP/arachis.mixed.mrk.Axiom_Arachis_58K_SNP.flank_seq.tsv.gz
note that the former is primer sequence pairs, similar to one of the alfalfa example whereas the arachis one is more like [A/G] at the variant site (although not as fasta).
anyway, because this representation would be genome-independent, I'm imagining it would live separate from the marker gffs (as in the above examples, but probably under markers rather than genetic?); and then gff marker mapping files derived using it would be as we currently have them, probably with some explicit reference to the sequence collection in their README. I think we'll be getting some more alfalfa markers from the Breeding Insight group and would like to handle them in some similar way, so we can figure out a good protocol for dealing with mapping to the growing number of autotetraploid genomes.
to which @sammyjava replied:
Yeah, presumably under /markers/. Since these would be marker-only data, not specific to a strain or genome assembly, I'd think the collections would have a name like mixed.mrk.Blair_Cortés_2018 or mixed.mrk.Axiom_Arachis_58K.