Gapless assemblies in OpenGenome2 #46
-
Hi @garykbrixi and Evo2 team - thank you for your inspiring work. I was reading the section on OpenGenome2 and looking at the training data in Hugging Face. My first question is, am I correct that OpenGenome2 consists almost entirely, or entirely, or short read sequencing derived genetic sequence? And, as a corollary, are areas like centromeres, segmental duplications / low copy repeats, telomeres, and so on presently absent from the training data? For instance, I think GRCh38 was used for humans - is this more or less true across the board? Final question - suppose you wanted to include gapless assemblies of various organisms. Right now, would that be challenging for Evo2, granted that its training data doesnt include long read sequencing derived gapless assemblies? I would imagine a LORA based appraoch or fine tuning of some kind would have poorer performance for both the old data and the newly introduced data - is that correct? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Hello, thanks for your interest! Yeah most of these NCBI genomes were constructed from short reads. Regions annotated as centromeres were excluded from the training data. We have not assessed the model on gapless assemblies, but I expect that Evo 2 should transfer well. Using finetuning to include additional data should also be a reasonable approach. Note our context extension uses a mixture of genomes and also short genic focused data to maintain performance at both long and short context. This can be important and details of the data composition weight are in our preprint methods and supplement. |
Beta Was this translation helpful? Give feedback.
This comment has been minimized.
This comment has been minimized.
-
@garykbrixi thank you for your response! I'll tip my hand a bit. Actually, the reason I started thinking about this is because of the statement you made in the main page about your active interest in extending context window. I think there is some literature suggesting that use of a graph based format could possibly be one tool by which to further increase context window beyond the impressive length youve already achieved. I understand Striped Hyena 2 in broad strokes, but I do not undersstand it sufficiently well at this point to claim that I think those ideas might be integrable with the kind of multi hybrid model you've designed. I suppose I'll close by saying, I fully appreciate that this kind of change would likely be substantive, and as such honestly might be more apprropraite for Evo3 than Evo2. Having said that, did want to hint at this to see if you had interest. I am at vlaufer at med dot umich dot edu or laufer at openchromatin dot org if youre interested, though i am sure you are inundated with questions and ideas. Thank you for your foundational (pun intended) work. Vincent |
Beta Was this translation helpful? Give feedback.
Hello, thanks for your interest!
Yeah most of these NCBI genomes were constructed from short reads. Regions annotated as centromeres were excluded from the training data. We have not assessed the model on gapless assemblies, but I expect that Evo 2 should transfer well. Using finetuning to include additional data should also be a reasonable approach.
Note our context extension uses a mixture of genomes and also short genic focused data to maintain performance at both long and short context. This can be important and details of the data composition weight are in our preprint methods and supplement.