Skip to content

Use sequence similarity to better do train/test splits #28

@gonzalobenegas

Description

@gonzalobenegas

This could be very important if we are going to use held-out perplexity to guide model development.

Related approach for protein language models (pLMs) in ESM2 paper:

All train sequences which match a validation sequence with 50% sequence identity under this search are removed from the train set.

Related approach for gLMs:

Finally, we removed any training sequence for which both coverage ≥ 5% and identity ≥ 30% in any alignment, yielding a leakage-free training corpus

Chao, Kuan-Hao, et al. "Predicting dynamic expression patterns in budding yeast with a fungal DNA language model." bioRxiv (2025)

Some challenges in DNA:

  • Genomic windows typically used for training don't have a clear beginning and end, but are often created using arbitrary offsets. For example, the same sequence could be shifted [0, 512) -> [256, 768) and have 100% sequence identity for 50% of positions. We need to define cutoffs for both coverage and identity.
  • Repetitive elements should be handled with care.
  • Alignment statistics and thresholds should probably be reconsidered every time the context size is modified.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions