Ettin: an Open Suite of Paired Encoders and Decoders

🎯 TL;DR: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.

📄 Paper | 🤗 Model Collection | 📊 Training Data

This repository contains the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.

🚀 Quick Start

Installation

pip install torch>=1.9.0
# until the new pip release, install from main to use decoders (transformers>=4.54.X will contain it)
# encoders work with transformers>=4.48.0
pip install git+https://github.com/huggingface/transformers.git```

### 30-Second Examples

**Encoder for Classification/Embeddings:**
```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-150m")
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m")

# Example: Get embeddings
inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)

Decoder for Text Generation:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")

# Example: Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🤖 Model Family

Encoder Models

Size	Model	Parameters	Best For
XXS	ettin-encoder-17m	17M	Mobile/Edge devices
XS	ettin-encoder-32m	32M	Fast inference
Small	ettin-encoder-68m	68M	Balanced performance
Base	ettin-encoder-150m	150M	Standard use cases
Large	ettin-encoder-400m	400M	High accuracy needs
XL	ettin-encoder-1b	1B	Best performance

Decoder Models

Size	Model	Parameters	Best For
XXS	ettin-decoder-17m	17M	Lightweight generation
XS	ettin-decoder-32m	32M	Quick prototyping
Small	ettin-decoder-68m	68M	Efficient generation
Base	ettin-decoder-150m	150M	Standard generation
Large	ettin-decoder-400m	400M	Quality generation
XL	ettin-decoder-1b	1B	Best generation

Cross-Objective Models

These models demonstrate what happens when you continue training encoders as decoders (and vice versa). Important: Load these models using the architecture they were converted to, not their original architecture.

Encoders Trained from Decoders (Decoder → MLM)

Load as encoders using AutoModel or AutoModelForMaskedLM:

Size	Model	Parameters	Description
XXS	ettin-encoder-from-decoder-17m	17M	Decoder → MLM continued training
XS	ettin-encoder-from-decoder-32m	32M	Decoder → MLM continued training
Small	ettin-encoder-from-decoder-68m	68M	Decoder → MLM continued training
Base	ettin-encoder-from-decoder-150m	150M	Decoder → MLM continued training
Large	ettin-encoder-from-decoder-400m	400M	Decoder → MLM continued training
XL	ettin-encoder-from-decoder-1b	1B	Decoder → MLM continued training

Decoders Trained from Encoders (Encoder → CLM)

Load as decoders using AutoModelForCausalLM:

Size	Model	Parameters	Description
XXS	ettin-decoder-from-encoder-17m	17M	Encoder → CLM continued training
XS	ettin-decoder-from-encoder-32m	32M	Encoder → CLM continued training
Small	ettin-decoder-from-encoder-68m	68M	Encoder → CLM continued training
Base	ettin-decoder-from-encoder-150m	150M	Encoder → CLM continued training
Large	ettin-decoder-from-encoder-400m	400M	Encoder → CLM continued training
XL	ettin-decoder-from-encoder-1b	1B	Encoder → CLM continued training

Accessing Training Checkpoints

Beyond the final models, we provide access to intermediate training checkpoints for research and analysis purposes. All raw training checkpoints are available in the jhu-clsp/ettin-checkpoints dataset.

Each model repository contains multiple tagged versions representing different training stages:

step{number} - Pretraining phase checkpoints (e.g., step599525, step596528)
ext{number} - Extension/mid-training phase checkpoints (e.g., ext1000, ext2000)
decay{number} - Decay phase checkpoints (e.g., decay100, decay500)

from transformers import AutoModelForCausalLM

# Load a specific pretraining checkpoint
model = AutoModelForCausalLM.from_pretrained(
    "jhu-clsp/ettin-decoder-400m", 
    revision="step590532"  # Specific checkpoint tag
)

Getting Started

Training Data

The complete training dataset is publicly available:

Pre-training Data: jhu-clsp/ettin-pretraining-data - 1.7T tokens
Mid-training Data: jhu-clsp/ettin-extension-data - 250B tokens
Decay Phase Data: jhu-clsp/ettin-decay-data - 100B tokens
Training Order: jhu-clsp/ettin-data-order - Batch-level training order

Usage Examples

Encoder: Masked Language Modeling

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load MLM model
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-150m")
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/ettin-encoder-150m")

def predict_masked_token(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get predictions for [MASK] tokens
    mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
    predictions = outputs.logits[mask_indices]
    
    # Get top 5 predictions
    top_tokens = torch.topk(predictions, 5, dim=-1)
    return [tokenizer.decode(token) for token in top_tokens.indices[0]]

# Example
masked_text = "The capital of France is [MASK]."
predictions = predict_masked_token(masked_text)
print(f"Predictions: {predictions}")

Decoder: Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer  
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")

# Set pad token if needed
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def generate_text(prompt, max_length=100, temperature=0.7):
    inputs = tokenizer(prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_length=max_length,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            num_return_sequences=1
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
prompt = "The future of artificial intelligence is"
generated = generate_text(prompt)
print(generated)

Cross-Objective Models

# Encoder-from-decoder: Load as encoder
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-from-decoder-150m")
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-from-decoder-150m")

# Decoder-from-encoder: Load as decoder  
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-from-encoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-from-encoder-150m")

📋 Training and Evaluation

Pre-training

For details on model pre-training, data preparation, and training recipes:

📖 Pre-training Guide - Complete training setup, data mixture, and ModernBERT recipe adaptation

Evaluation

Encoder Evaluation

📊 Encoder on Generative Tasks - Evaluating encoders on language modeling tasks using our lm-evaluation-harness fork
🔍 Encoder Retrieval Training - Fine-tuning on MS MARCO and evaluation on MTEB v2 English
🎯 GLUE Evaluation - Comprehensive GLUE benchmark evaluation with fine-tuning scripts

Decoder Evaluation

🎯 Decoder on Generative Tasks - Using EleutherAI evaluation harness (commit 867413f8677f00f6a817262727cbb041bf36192a) for comprehensive generative task evaluation

Bias Evaluation

⚖️ Gender Bias Evaluation - Comprehensive gender bias testing using Winogender dataset gotcha examples. Tests how well models handle counter-stereotypical pronouns in occupational contexts. Supports both encoder (MLM) and decoder (perplexity) evaluation methods.

Quick Decoder Evaluation Example

# Clone the specific commit of lm-evaluation-harness
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout 867413f8677f00f6a817262727cbb041bf36192a
pip install -e .

# Run evaluation on Ettin decoder
lm_eval --model hf \
    --model_args pretrained=jhu-clsp/ettin-decoder-150m \
    --tasks hellaswag,arc_easy,arc_challenge,winogrande \
    --device cuda:0 \
    --batch_size 8

🔬 Research Applications

What Makes Ettin Unique

Ettin provides the first controlled comparison of encoder vs. decoder architectures:

Identical Training Data: Same 2T token mixture across all models
Matched Architectures: Only attention patterns and objectives differ
Open Everything: Training data, model weights, and batch-level training order
Multiple Scales: Fair comparison from 17M to 1B parameters
250+ Checkpoints: Complete training trajectory analysis

Training Details

Model Architecture

Parameter	17M	32M	68M	150M	400M	1B
Layers	7	10	19	22	28	28
Hidden Size	256	384	512	768	1024	1792
Intermediate Size	384	576	768	1152	2624	3840
Attention Heads	4	6	8	12	16	28

Training Configuration

Data: High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens

Architecture Features:

Transformer with RoPE, GLU activations, and prenorm layers
Context length: Up to 8K tokens
Vocabulary: 50,368 tokens (ModernBERT tokenizer)
Deep but efficient architectures following MobileLLM principles

Training Phases:

Pre-training: 1.7T tokens with diverse data mixture
Mid-training: 250B tokens with higher-quality filtered data and context extension to 8K
Decay phase: 50B tokens with premium data sources

❓ FAQ

Model Loading Issues

Q: I'm getting an error that ModernBERT-decoder isn't found. A: Make sure you have the latest version of transformers installed:

# for the latest version until the official pypi release:
pip install git+https://github.com/huggingface/transformers.git

Q: Which model should I choose for my task? A:

Classification/Retrieval/Understanding: Use encoder models
Text Generation/Chat/Completion: Use decoder models
Research on cross-training: Use cross-objective models
Size selection: Start with 150M for experimentation, scale up to 400M or 1B for production

Q: How do I access training checkpoints? A: Each model has multiple git tags for different training stages. Use the revision parameter:

model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m", revision="step500000")

Q: Can I continue training these models? A: Yes! We provide raw checkpoints in the jhu-clsp/ettin-checkpoints dataset that can be loaded into training frameworks.

Q: What's the difference between cross-objective models and regular models? A: Cross-objective models started as one architecture (e.g., decoder) and were continued with a different objective (e.g., MLM). They demonstrate the limitations of cross-training and generally underperform native models.

Q: How do I reproduce the paper results? A: See our evaluation guides:

Citation

If you use Ettin models in your research, please cite our work:

@misc{weller2025seqvsseqopen,
      title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, 
      author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
      year={2025},
      eprint={2507.11412},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.11412}, 
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact: For questions about the models or research, please open an issue or contact the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
bias_eval		bias_eval
docs		docs
glue_evaluation		glue_evaluation
pretraining		pretraining
retrieval_eval		retrieval_eval
LICENSE		LICENSE
README.md		README.md

License

JHU-CLSP/ettin-encoder-vs-decoder

Folders and files

Latest commit

History

Repository files navigation

Ettin: an Open Suite of Paired Encoders and Decoders

Table of Contents

🚀 Quick Start

Installation

🤖 Model Family

Encoder Models

Decoder Models

Cross-Objective Models

Encoders Trained from Decoders (Decoder → MLM)

Decoders Trained from Encoders (Encoder → CLM)

Accessing Training Checkpoints

Getting Started

Training Data

Usage Examples

📋 Training and Evaluation

Pre-training

Evaluation

Encoder Evaluation

Decoder Evaluation

Bias Evaluation

Quick Decoder Evaluation Example

🔬 Research Applications

What Makes Ettin Unique

Training Details

Model Architecture

Training Configuration

❓ FAQ

Model Loading Issues

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages