-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
This could be very important if we are going to use held-out perplexity to guide model development.
Related approach for protein language models (pLMs) in ESM2 paper:
All train sequences which match a validation sequence with 50% sequence identity under this search are removed from the train set.
Related approach for gLMs:
Finally, we removed any training sequence for which both coverage ≥ 5% and identity ≥ 30% in any alignment, yielding a leakage-free training corpus
Some challenges in DNA:
- Genomic windows typically used for training don't have a clear beginning and end, but are often created using arbitrary offsets. For example, the same sequence could be shifted [0, 512) -> [256, 768) and have 100% sequence identity for 50% of positions. We need to define cutoffs for both coverage and identity.
- Repetitive elements should be handled with care.
- Alignment statistics and thresholds should probably be reconsidered every time the context size is modified.
Reactions are currently unavailable