Case Sensitivity of Models ? #166

eyes-robson · 2025-07-30T21:34:57Z

eyes-robson
Jul 30, 2025

Doing some evals of Evo2 and noticed that the tokenizer is actually case-sensitive (e.g. 'TCG' and the repeat-masker-masked 'tcg' are NOT the same).

Obviously, this represents extra information that case-insensitive (i.e. all caps) models may not have access to, but I don't know if Evo2's training would make all caps inputs out-of-distribution...?

Does anyone know whether this materially impacts downstream inference? Or if there is a 'preferred' casing strategy?

Answered by garykbrixi

Jul 31, 2025

Yes Evo2_7B saw all uppercased for the midtraining phase (the long context extension which is the final 0.3T), while the 7B base and 1B base models did not since they are below 3T. Note all models have weight tying between embedding and final projection layer

View full answer

garykbrixi · 2025-07-30T23:27:58Z

garykbrixi
Jul 30, 2025
Maintainer

Hi @eyes-robson, we use Evo 2 with all uppercasing for all evals in the paper. Details on treatment of casing are in the methods

For any base pair, the model is always tasked with predicting the uppercase character. For the first 3T
tokens of pretraining, lowercase tokens are input to the model to add information on which portions of DNA
are repetitive. This was done to further help learn different representations for interspersed repeats, which
are very common in many eukaryotic genomes. For additional pretraining and for all midtraining, all inputs
to the model are uppercase. Loss is masked on special tokens used to condition the model that we do not want
to generate, including the stitch tokens ‘@’ and ‘#’ , as well as the multi-token phylogenetic tags used during
midtraining.

2 replies

eyes-robson Jul 31, 2025
Author

got it! thanks for the elaboration, especially wrt outputs and the paper's internal eval settings.

However, I'm still unsure of something -- for Evo2_1B_base, Evo2_7B_base, and Evo2_7B, which all see fewer than 3T tokens (1T, 2.1T, and 2.4T tokens, respectively, per tables 2 and 3), does this mean they have only seen eukaryotic genomes with repeat masking present? Since they do not hit this 3T token mark.

Alternatively, is this 3T figure a proportion of Evo2_40B's total tokens that is scaled down for these other models?

garykbrixi Jul 31, 2025
Maintainer

Yes Evo2_7B saw all uppercased for the midtraining phase (the long context extension which is the final 0.3T), while the 7B base and 1B base models did not since they are below 3T. Note all models have weight tying between embedding and final projection layer

Answer selected by eyes-robson

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Case Sensitivity of Models ? #166

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Case Sensitivity of Models ? #166

Uh oh!

eyes-robson Jul 30, 2025

Replies: 1 comment · 2 replies

Uh oh!

garykbrixi Jul 30, 2025 Maintainer

Uh oh!

eyes-robson Jul 31, 2025 Author

Uh oh!

garykbrixi Jul 31, 2025 Maintainer

eyes-robson
Jul 30, 2025

Replies: 1 comment 2 replies

garykbrixi
Jul 30, 2025
Maintainer

eyes-robson Jul 31, 2025
Author

garykbrixi Jul 31, 2025
Maintainer