Description
Instead of character-level tokenization, try https://huggingface.co/bolinas-dna/tokenizer-4-mer and https://huggingface.co/bolinas-dna/tokenizer-8-mer. Train on promoter dataset and evaluate on zero-shot VEP.
Hypothesis or Goal
Tokenization can influence both downstream task performance and training/inference speed. Character-level is a good default but worth exploring additional options.