Advancing Codon Language Modeling with Synonymous Codon Constrained Masking

This repository contains code to utilize the model, and reproduce results of the paper Advancing Codon Language Modeling with Synonymous Codon Constrained Masking.
Unlike other Codon Language Models, SynCodonLM was trained with logit-level control, masking logits for non-synonymous codons. This allowed the model to learn codon-specific patterns disentangled from protein-level semantics.
Pre-training dataset of 43 Million CDS is available on Hugging Face here.

Installation

git clone https://github.com/Boehringer-Ingelheim/SynCodonLM.git
cd SynCodonLM
pip install -r requirements.txt #maybe not neccesary depending on your env :)

Usage

SynCodonLM uses token-type ID's to add species-specific codon context.

Before use, find the token type ID (species_token_type) for your species of interest here!

Or use our list of model organisms below

Embedding a Coding DNA Sequence

from SynCodonLM import CodonEmbeddings

model = CodonEmbeddings() #this loads the model & tokenizer using our built-in functions

seq = 'ATGTCCACCGGGCGGTGA'

mean_pooled_embedding = model.get_mean_embedding(seq, species_token_type=30) #E. coli
#returns --> tensor of shape [768]

raw_output = model.get_raw_embeddings(seq, species_token_type=30) #E. coli
raw_embedding_final_layer = raw_output.hidden_states[-1] #treat this like a typical Hugging Face model dictionary based output!
#returns --> tensor of shape [batch size (1), sequence length, 768]

Codon Optimizing a Protein Sequence

This has not yet been rigourosly evaluated, although we can confidently say it will generate 'natural looking' coding-DNA sequences.

from SynCodonLM import CodonOptimizer

optimizer = CodonOptimizer() #this loads the model & tokenizer using our built-in functions

result = optimizer.optimize(
    protein_sequence="MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK", #GFP 
    species_token_type=30, #E. coli
    deterministic=True #true by default
)
codon_optimized_sequence = result.sequence

Embedding a Coding DNA Sequence Using our Model Trained without Token Type ID

from SynCodonLM import CodonEmbeddings

model = CodonEmbeddings(model_name='jheuschkel/SynCodonLM-V2-NoTokenType') #this loads the model & tokenizer using our built-in functions

seq = 'ATGTCCACCGGGCGGTGA'

mean_pooled_embedding = model.get_mean_embedding(seq)
#returns --> tensor of shape [768]

raw_output = model.get_raw_embeddings(seq)
raw_embedding_final_layer = raw_output.hidden_states[-1] #treat this like a typical Hugging Face model dictionary based output!
#returns --> tensor of shape [batch size (1), sequence length, 768]

Citation

If you use this work, please cite:

@article{10.1093/nar/gkag166,
    author = {Heuschkel, James and Kingsley, Laura and Pefaur, Noah and Nixon, Andrew and Cramer, Steven},
    title = {Advancing codon language modeling with synonymous codon constrained masking},
    journal = {Nucleic Acids Research},
    volume = {54},
    number = {5},
    pages = {gkag166},
    year = {2026},
    month = {02},
    abstract = {Codon language models offer a promising framework for modeling protein-coding DNA sequences, yet current approaches often conflate codon usage with amino acid semantics, limiting their ability to capture DNA-level biology. We introduce SynCodonLM, a codon language model that enforces a biologically grounded constraint: masked codons are only predicted from synonymous options, guided by the known protein sequence. This design disentangles codon-level from protein-level semantics, enabling the model to learn nucleotide-specific patterns. The constraint is implemented by masking non-synonymous codons from the prediction space prior to softmax. Unlike existing models, which cluster codons by amino acid identity, SynCodonLM clusters by nucleotide properties, revealing structure aligned with DNA-level biology. Furthermore, SynCodonLM outperforms existing models on six of seven benchmarks sensitive to DNA-level features, including messenger RNA and protein expression. Our approach advances domain-specific representation learning and opens avenues for sequence design in synthetic biology, as well as deeper insights into diverse bioprocesses.},
    issn = {1362-4962},
    doi = {10.1093/nar/gkag166},
    url = {https://doi.org/10.1093/nar/gkag166},
    eprint = {https://academic.oup.com/nar/article-pdf/54/5/gkag166/67103471/gkag166.pdf},
}
}

Model Organisms Species Token Type IDs

Organism	Token-Type ID
E. coli	30
S. cerevisiae	118
C. elegans	212
D. melanogaster	190
D. rerio	428
M. musculus	368
A. thaliana	258
H. sapiens	373
C. griseus	345

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
SynCodonLM		SynCodonLM
ablations		ablations
benchmarking-datasets		benchmarking-datasets
example-bash		example-bash
figure-data		figure-data
LICENSE		LICENSE
README.md		README.md
benchmarking.py		benchmarking.py
example-codon-optimization.ipynb		example-codon-optimization.ipynb
example-embedding.ipynb		example-embedding.ipynb
pretrain.py		pretrain.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advancing Codon Language Modeling with Synonymous Codon Constrained Masking

Installation

Usage

SynCodonLM uses token-type ID's to add species-specific codon context.

Before use, find the token type ID (species_token_type) for your species of interest here!

Or use our list of model organisms below

Embedding a Coding DNA Sequence

Codon Optimizing a Protein Sequence

This has not yet been rigourosly evaluated, although we can confidently say it will generate 'natural looking' coding-DNA sequences.

Embedding a Coding DNA Sequence Using our Model Trained without Token Type ID

Citation

Model Organisms Species Token Type IDs

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Advancing Codon Language Modeling with Synonymous Codon Constrained Masking

Installation

Usage

SynCodonLM uses token-type ID's to add species-specific codon context.

Before use, find the token type ID (species_token_type) for your species of interest here!

Or use our list of model organisms below

Embedding a Coding DNA Sequence

Codon Optimizing a Protein Sequence

This has not yet been rigourosly evaluated, although we can confidently say it will generate 'natural looking' coding-DNA sequences.

Embedding a Coding DNA Sequence Using our Model Trained without Token Type ID

Citation

Model Organisms Species Token Type IDs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages