π― TL;DR: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.
π Paper | π€ Model Collection | π Training Data
This repository contains the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.
pip install torch>=1.9.0
# until the new pip release, install from main to use decoders (transformers>=4.54.X will contain it)
# encoders work with transformers>=4.48.0
pip install git+https://github.com/huggingface/transformers.git```
### 30-Second Examples
**Encoder for Classification/Embeddings:**
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-150m")
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m")
# Example: Get embeddings
inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
Decoder for Text Generation:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
# Example: Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Size | Model | Parameters | Best For | Download |
---|---|---|---|---|
XXS | ettin-encoder-17m | 17M | Mobile/Edge devices | |
XS | ettin-encoder-32m | 32M | Fast inference | |
Small | ettin-encoder-68m | 68M | Balanced performance | |
Base | ettin-encoder-150m | 150M | Standard use cases | |
Large | ettin-encoder-400m | 400M | High accuracy needs | |
XL | ettin-encoder-1b | 1B | Best performance |
Size | Model | Parameters | Best For | Download |
---|---|---|---|---|
XXS | ettin-decoder-17m | 17M | Lightweight generation | |
XS | ettin-decoder-32m | 32M | Quick prototyping | |
Small | ettin-decoder-68m | 68M | Efficient generation | |
Base | ettin-decoder-150m | 150M | Standard generation | |
Large | ettin-decoder-400m | 400M | Quality generation | |
XL | ettin-decoder-1b | 1B | Best generation |
These models demonstrate what happens when you continue training encoders as decoders (and vice versa). Important: Load these models using the architecture they were converted to, not their original architecture.
Load as encoders using AutoModel
or AutoModelForMaskedLM
:
Size | Model | Parameters | Description | Download |
---|---|---|---|---|
XXS | ettin-encoder-from-decoder-17m | 17M | Decoder β MLM continued training | |
XS | ettin-encoder-from-decoder-32m | 32M | Decoder β MLM continued training | |
Small | ettin-encoder-from-decoder-68m | 68M | Decoder β MLM continued training | |
Base | ettin-encoder-from-decoder-150m | 150M | Decoder β MLM continued training | |
Large | ettin-encoder-from-decoder-400m | 400M | Decoder β MLM continued training | |
XL | ettin-encoder-from-decoder-1b | 1B | Decoder β MLM continued training |
Load as decoders using AutoModelForCausalLM
:
Size | Model | Parameters | Description | Download |
---|---|---|---|---|
XXS | ettin-decoder-from-encoder-17m | 17M | Encoder β CLM continued training | |
XS | ettin-decoder-from-encoder-32m | 32M | Encoder β CLM continued training | |
Small | ettin-decoder-from-encoder-68m | 68M | Encoder β CLM continued training | |
Base | ettin-decoder-from-encoder-150m | 150M | Encoder β CLM continued training | |
Large | ettin-decoder-from-encoder-400m | 400M | Encoder β CLM continued training | |
XL | ettin-decoder-from-encoder-1b | 1B | Encoder β CLM continued training |
Beyond the final models, we provide access to intermediate training checkpoints for research and analysis purposes. All raw training checkpoints are available in the jhu-clsp/ettin-checkpoints dataset.
Each model repository contains multiple tagged versions representing different training stages:
step{number}
- Pretraining phase checkpoints (e.g.,step599525
,step596528
)ext{number}
- Extension/mid-training phase checkpoints (e.g.,ext1000
,ext2000
)decay{number}
- Decay phase checkpoints (e.g.,decay100
,decay500
)
from transformers import AutoModelForCausalLM
# Load a specific pretraining checkpoint
model = AutoModelForCausalLM.from_pretrained(
"jhu-clsp/ettin-decoder-400m",
revision="step590532" # Specific checkpoint tag
)
The complete training dataset is publicly available:
- Pre-training Data: jhu-clsp/ettin-pretraining-data - 1.7T tokens
- Mid-training Data: jhu-clsp/ettin-extension-data - 250B tokens
- Decay Phase Data: jhu-clsp/ettin-decay-data - 100B tokens
- Training Order: jhu-clsp/ettin-data-order - Batch-level training order
Encoder: Masked Language Modeling
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# Load MLM model
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-150m")
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/ettin-encoder-150m")
def predict_masked_token(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Get predictions for [MASK] tokens
mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
predictions = outputs.logits[mask_indices]
# Get top 5 predictions
top_tokens = torch.topk(predictions, 5, dim=-1)
return [tokenizer.decode(token) for token in top_tokens.indices[0]]
# Example
masked_text = "The capital of France is [MASK]."
predictions = predict_masked_token(masked_text)
print(f"Predictions: {predictions}")
Decoder: Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")
# Set pad token if needed
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def generate_text(prompt, max_length=100, temperature=0.7):
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=max_length,
temperature=temperature,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
num_return_sequences=1
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example usage
prompt = "The future of artificial intelligence is"
generated = generate_text(prompt)
print(generated)
Cross-Objective Models
# Encoder-from-decoder: Load as encoder
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-from-decoder-150m")
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-from-decoder-150m")
# Decoder-from-encoder: Load as decoder
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-from-encoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-from-encoder-150m")
For details on model pre-training, data preparation, and training recipes:
- π Pre-training Guide - Complete training setup, data mixture, and ModernBERT recipe adaptation
- π Encoder on Generative Tasks - Evaluating encoders on language modeling tasks using our lm-evaluation-harness fork
- π Encoder Retrieval Training - Fine-tuning on MS MARCO and evaluation on MTEB v2 English
- π― GLUE Evaluation - Comprehensive GLUE benchmark evaluation with fine-tuning scripts
- π― Decoder on Generative Tasks - Using EleutherAI evaluation harness (commit
867413f8677f00f6a817262727cbb041bf36192a
) for comprehensive generative task evaluation
- βοΈ Gender Bias Evaluation - Comprehensive gender bias testing using Winogender dataset gotcha examples. Tests how well models handle counter-stereotypical pronouns in occupational contexts. Supports both encoder (MLM) and decoder (perplexity) evaluation methods.
# Clone the specific commit of lm-evaluation-harness
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout 867413f8677f00f6a817262727cbb041bf36192a
pip install -e .
# Run evaluation on Ettin decoder
lm_eval --model hf \
--model_args pretrained=jhu-clsp/ettin-decoder-150m \
--tasks hellaswag,arc_easy,arc_challenge,winogrande \
--device cuda:0 \
--batch_size 8
Ettin provides the first controlled comparison of encoder vs. decoder architectures:
- Identical Training Data: Same 2T token mixture across all models
- Matched Architectures: Only attention patterns and objectives differ
- Open Everything: Training data, model weights, and batch-level training order
- Multiple Scales: Fair comparison from 17M to 1B parameters
- 250+ Checkpoints: Complete training trajectory analysis
Parameter | 17M | 32M | 68M | 150M | 400M | 1B |
---|---|---|---|---|---|---|
Layers | 7 | 10 | 19 | 22 | 28 | 28 |
Hidden Size | 256 | 384 | 512 | 768 | 1024 | 1792 |
Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
Data: High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens
Architecture Features:
- Transformer with RoPE, GLU activations, and prenorm layers
- Context length: Up to 8K tokens
- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
- Deep but efficient architectures following MobileLLM principles
Training Phases:
- Pre-training: 1.7T tokens with diverse data mixture
- Mid-training: 250B tokens with higher-quality filtered data and context extension to 8K
- Decay phase: 50B tokens with premium data sources
Q: I'm getting an error that ModernBERT-decoder isn't found. A: Make sure you have the latest version of transformers installed:
# for the latest version until the official pypi release:
pip install git+https://github.com/huggingface/transformers.git
Q: Which model should I choose for my task? A:
- Classification/Retrieval/Understanding: Use encoder models
- Text Generation/Chat/Completion: Use decoder models
- Research on cross-training: Use cross-objective models
- Size selection: Start with 150M for experimentation, scale up to 400M or 1B for production
Q: How do I access training checkpoints?
A: Each model has multiple git tags for different training stages. Use the revision
parameter:
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m", revision="step500000")
Q: Can I continue training these models? A: Yes! We provide raw checkpoints in the jhu-clsp/ettin-checkpoints dataset that can be loaded into training frameworks.
Q: What's the difference between cross-objective models and regular models? A: Cross-objective models started as one architecture (e.g., decoder) and were continued with a different objective (e.g., MLM). They demonstrate the limitations of cross-training and generally underperform native models.
Q: How do I reproduce the paper results? A: See our evaluation guides:
If you use Ettin models in your research, please cite our work:
@misc{weller2025seqvsseqopen,
title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders},
author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2507.11412},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11412},
}
This project is licensed under the MIT License - see the LICENSE file for details.
Contact: For questions about the models or research, please open an issue or contact the authors.