Gapless assemblies in OpenGenome2 #46

LauferVA · 2025-02-28T00:58:04Z

LauferVA
Feb 28, 2025

Hi @garykbrixi and Evo2 team - thank you for your inspiring work.

I was reading the section on OpenGenome2 and looking at the training data in Hugging Face.

My first question is, am I correct that OpenGenome2 consists almost entirely, or entirely, or short read sequencing derived genetic sequence? And, as a corollary, are areas like centromeres, segmental duplications / low copy repeats, telomeres, and so on presently absent from the training data?

For instance, I think GRCh38 was used for humans - is this more or less true across the board?

Final question - suppose you wanted to include gapless assemblies of various organisms. Right now, would that be challenging for Evo2, granted that its training data doesnt include long read sequencing derived gapless assemblies?

I would imagine a LORA based appraoch or fine tuning of some kind would have poorer performance for both the old data and the newly introduced data - is that correct?

Answered by garykbrixi

Mar 2, 2025

Hello, thanks for your interest!

Yeah most of these NCBI genomes were constructed from short reads. Regions annotated as centromeres were excluded from the training data. We have not assessed the model on gapless assemblies, but I expect that Evo 2 should transfer well. Using finetuning to include additional data should also be a reasonable approach.

Note our context extension uses a mixture of genomes and also short genic focused data to maintain performance at both long and short context. This can be important and details of the data composition weight are in our preprint methods and supplement.

View full answer

garykbrixi · 2025-03-02T20:41:04Z

garykbrixi
Mar 2, 2025
Maintainer

Hello, thanks for your interest!

Yeah most of these NCBI genomes were constructed from short reads. Regions annotated as centromeres were excluded from the training data. We have not assessed the model on gapless assemblies, but I expect that Evo 2 should transfer well. Using finetuning to include additional data should also be a reasonable approach.

Note our context extension uses a mixture of genomes and also short genic focused data to maintain performance at both long and short context. This can be important and details of the data composition weight are in our preprint methods and supplement.

1 reply

LauferVA Mar 6, 2025
Author

i posted a new reply below but this effecrtively answers the question. i will close in a few days. until then, hoping you see the below comment! thanks.

LauferVA · 2025-03-06T01:28:37Z

LauferVA
Mar 6, 2025
Author

@garykbrixi thank you for your response!

I'll tip my hand a bit. Actually, the reason I started thinking about this is because of the statement you made in the main page about your active interest in extending context window.

I think there is some literature suggesting that use of a graph based format could possibly be one tool by which to further increase context window beyond the impressive length youve already achieved.

I understand Striped Hyena 2 in broad strokes, but I do not undersstand it sufficiently well at this point to claim that I think those ideas might be integrable with the kind of multi hybrid model you've designed.

I suppose I'll close by saying, I fully appreciate that this kind of change would likely be substantive, and as such honestly might be more apprropraite for Evo3 than Evo2. Having said that, did want to hint at this to see if you had interest. I am at vlaufer at med dot umich dot edu or laufer at openchromatin dot org if youre interested, though i am sure you are inundated with questions and ideas.

Thank you for your foundational (pun intended) work.

Vincent

0 replies

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gapless assemblies in OpenGenome2 #46

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

This comment has been minimized.

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Gapless assemblies in OpenGenome2 #46

Uh oh!

LauferVA Feb 28, 2025

Replies: 3 comments · 1 reply

Uh oh!

garykbrixi Mar 2, 2025 Maintainer

Uh oh!

LauferVA Mar 6, 2025 Author

This comment has been minimized.

Uh oh!

LauferVA Mar 6, 2025 Author

LauferVA
Feb 28, 2025

Replies: 3 comments 1 reply

garykbrixi
Mar 2, 2025
Maintainer

LauferVA Mar 6, 2025
Author

LauferVA
Mar 6, 2025
Author