Use sequence similarity to better do train/test splits

This could be very important if we are going to use held-out perplexity to guide model development.

Related approach for protein language models (pLMs) in [ESM2 paper](https://www.science.org/doi/10.1126/science.ade2574):
> All train sequences which match a validation sequence with 50% sequence identity under this search are removed from the train set.

Related approach for gLMs:
> Finally, we removed any training sequence for which both coverage ≥ 5% and identity ≥ 30% in any alignment, yielding a leakage-free training corpus

[Chao, Kuan-Hao, et al. "Predicting dynamic expression patterns in budding yeast with a fungal DNA language model." bioRxiv (2025)](https://www.biorxiv.org/content/10.1101/2025.09.19.677475v1)

Some challenges in DNA:
- Genomic windows typically used for training don't have a clear beginning and end, but are often created using arbitrary offsets. For example, the same sequence could be shifted [0, 512) -> [256, 768) and have 100% sequence identity for 50% of positions. We need to define cutoffs for both coverage and identity.
- Repetitive elements should be handled with care.
- Alignment statistics and thresholds should probably be reconsidered every time the context size is modified.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use sequence similarity to better do train/test splits #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use sequence similarity to better do train/test splits #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions