Skip to content

JHU-CLSP/ettin-encoder-vs-decoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ettin: an Open Suite of Paired Encoders and Decoders

License: MIT Paper Models Data

🎯 TL;DR: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2.

πŸ“„ Paper | πŸ€— Model Collection | πŸ“Š Training Data

This repository contains the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.

Table of Contents

πŸš€ Quick Start

Installation

pip install torch>=1.9.0
# until the new pip release, install from main to use decoders (transformers>=4.54.X will contain it)
# encoders work with transformers>=4.48.0
pip install git+https://github.com/huggingface/transformers.git```

### 30-Second Examples

**Encoder for Classification/Embeddings:**
```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-150m")
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m")

# Example: Get embeddings
inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)

Decoder for Text Generation:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")

# Example: Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ€– Model Family

Encoder Models

Size Model Parameters Best For Download
XXS ettin-encoder-17m 17M Mobile/Edge devices Download
XS ettin-encoder-32m 32M Fast inference Download
Small ettin-encoder-68m 68M Balanced performance Download
Base ettin-encoder-150m 150M Standard use cases Download
Large ettin-encoder-400m 400M High accuracy needs Download
XL ettin-encoder-1b 1B Best performance Download

Decoder Models

Size Model Parameters Best For Download
XXS ettin-decoder-17m 17M Lightweight generation Download
XS ettin-decoder-32m 32M Quick prototyping Download
Small ettin-decoder-68m 68M Efficient generation Download
Base ettin-decoder-150m 150M Standard generation Download
Large ettin-decoder-400m 400M Quality generation Download
XL ettin-decoder-1b 1B Best generation Download

Cross-Objective Models

These models demonstrate what happens when you continue training encoders as decoders (and vice versa). Important: Load these models using the architecture they were converted to, not their original architecture.

Encoders Trained from Decoders (Decoder β†’ MLM)

Load as encoders using AutoModel or AutoModelForMaskedLM:

Size Model Parameters Description Download
XXS ettin-encoder-from-decoder-17m 17M Decoder β†’ MLM continued training Download
XS ettin-encoder-from-decoder-32m 32M Decoder β†’ MLM continued training Download
Small ettin-encoder-from-decoder-68m 68M Decoder β†’ MLM continued training Download
Base ettin-encoder-from-decoder-150m 150M Decoder β†’ MLM continued training Download
Large ettin-encoder-from-decoder-400m 400M Decoder β†’ MLM continued training Download
XL ettin-encoder-from-decoder-1b 1B Decoder β†’ MLM continued training Download

Decoders Trained from Encoders (Encoder β†’ CLM)

Load as decoders using AutoModelForCausalLM:

Size Model Parameters Description Download
XXS ettin-decoder-from-encoder-17m 17M Encoder β†’ CLM continued training Download
XS ettin-decoder-from-encoder-32m 32M Encoder β†’ CLM continued training Download
Small ettin-decoder-from-encoder-68m 68M Encoder β†’ CLM continued training Download
Base ettin-decoder-from-encoder-150m 150M Encoder β†’ CLM continued training Download
Large ettin-decoder-from-encoder-400m 400M Encoder β†’ CLM continued training Download
XL ettin-decoder-from-encoder-1b 1B Encoder β†’ CLM continued training Download

Accessing Training Checkpoints

Beyond the final models, we provide access to intermediate training checkpoints for research and analysis purposes. All raw training checkpoints are available in the jhu-clsp/ettin-checkpoints dataset.

Each model repository contains multiple tagged versions representing different training stages:

  • step{number} - Pretraining phase checkpoints (e.g., step599525, step596528)
  • ext{number} - Extension/mid-training phase checkpoints (e.g., ext1000, ext2000)
  • decay{number} - Decay phase checkpoints (e.g., decay100, decay500)
from transformers import AutoModelForCausalLM

# Load a specific pretraining checkpoint
model = AutoModelForCausalLM.from_pretrained(
    "jhu-clsp/ettin-decoder-400m", 
    revision="step590532"  # Specific checkpoint tag
)

Getting Started

Training Data

The complete training dataset is publicly available:

Usage Examples

Encoder: Masked Language Modeling
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load MLM model
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-150m")
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/ettin-encoder-150m")

def predict_masked_token(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get predictions for [MASK] tokens
    mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
    predictions = outputs.logits[mask_indices]
    
    # Get top 5 predictions
    top_tokens = torch.topk(predictions, 5, dim=-1)
    return [tokenizer.decode(token) for token in top_tokens.indices[0]]

# Example
masked_text = "The capital of France is [MASK]."
predictions = predict_masked_token(masked_text)
print(f"Predictions: {predictions}")
Decoder: Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer  
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-150m")

# Set pad token if needed
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def generate_text(prompt, max_length=100, temperature=0.7):
    inputs = tokenizer(prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_length=max_length,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            num_return_sequences=1
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
prompt = "The future of artificial intelligence is"
generated = generate_text(prompt)
print(generated)
Cross-Objective Models
# Encoder-from-decoder: Load as encoder
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-from-decoder-150m")
model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-from-decoder-150m")

# Decoder-from-encoder: Load as decoder  
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/ettin-decoder-from-encoder-150m")
model = AutoModelForCausalLM.from_pretrained("jhu-clsp/ettin-decoder-from-encoder-150m")

πŸ“‹ Training and Evaluation

Pre-training

For details on model pre-training, data preparation, and training recipes:

  • πŸ“– Pre-training Guide - Complete training setup, data mixture, and ModernBERT recipe adaptation

Evaluation

Encoder Evaluation

Decoder Evaluation

  • 🎯 Decoder on Generative Tasks - Using EleutherAI evaluation harness (commit 867413f8677f00f6a817262727cbb041bf36192a) for comprehensive generative task evaluation

Bias Evaluation

  • βš–οΈ Gender Bias Evaluation - Comprehensive gender bias testing using Winogender dataset gotcha examples. Tests how well models handle counter-stereotypical pronouns in occupational contexts. Supports both encoder (MLM) and decoder (perplexity) evaluation methods.

Quick Decoder Evaluation Example

# Clone the specific commit of lm-evaluation-harness
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout 867413f8677f00f6a817262727cbb041bf36192a
pip install -e .

# Run evaluation on Ettin decoder
lm_eval --model hf \
    --model_args pretrained=jhu-clsp/ettin-decoder-150m \
    --tasks hellaswag,arc_easy,arc_challenge,winogrande \
    --device cuda:0 \
    --batch_size 8

πŸ”¬ Research Applications

What Makes Ettin Unique

Ettin provides the first controlled comparison of encoder vs. decoder architectures:

  • Identical Training Data: Same 2T token mixture across all models
  • Matched Architectures: Only attention patterns and objectives differ
  • Open Everything: Training data, model weights, and batch-level training order
  • Multiple Scales: Fair comparison from 17M to 1B parameters
  • 250+ Checkpoints: Complete training trajectory analysis

Training Details

Model Architecture

Parameter 17M 32M 68M 150M 400M 1B
Layers 7 10 19 22 28 28
Hidden Size 256 384 512 768 1024 1792
Intermediate Size 384 576 768 1152 2624 3840
Attention Heads 4 6 8 12 16 28

Training Configuration

Data: High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens

Architecture Features:

  • Transformer with RoPE, GLU activations, and prenorm layers
  • Context length: Up to 8K tokens
  • Vocabulary: 50,368 tokens (ModernBERT tokenizer)
  • Deep but efficient architectures following MobileLLM principles

Training Phases:

  • Pre-training: 1.7T tokens with diverse data mixture
  • Mid-training: 250B tokens with higher-quality filtered data and context extension to 8K
  • Decay phase: 50B tokens with premium data sources

❓ FAQ

Model Loading Issues

Q: I'm getting an error that ModernBERT-decoder isn't found. A: Make sure you have the latest version of transformers installed:

# for the latest version until the official pypi release:
pip install git+https://github.com/huggingface/transformers.git

Q: Which model should I choose for my task? A:

  • Classification/Retrieval/Understanding: Use encoder models
  • Text Generation/Chat/Completion: Use decoder models
  • Research on cross-training: Use cross-objective models
  • Size selection: Start with 150M for experimentation, scale up to 400M or 1B for production

Q: How do I access training checkpoints? A: Each model has multiple git tags for different training stages. Use the revision parameter:

model = AutoModel.from_pretrained("jhu-clsp/ettin-encoder-150m", revision="step500000")

Q: Can I continue training these models? A: Yes! We provide raw checkpoints in the jhu-clsp/ettin-checkpoints dataset that can be loaded into training frameworks.

Q: What's the difference between cross-objective models and regular models? A: Cross-objective models started as one architecture (e.g., decoder) and were continued with a different objective (e.g., MLM). They demonstrate the limitations of cross-training and generally underperform native models.

Q: How do I reproduce the paper results? A: See our evaluation guides:

Citation

If you use Ettin models in your research, please cite our work:

@misc{weller2025seqvsseqopen,
      title={Seq vs Seq: An Open Suite of Paired Encoders and Decoders}, 
      author={Orion Weller and Kathryn Ricci and Marc Marone and Antoine Chaffin and Dawn Lawrie and Benjamin Van Durme},
      year={2025},
      eprint={2507.11412},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.11412}, 
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


Contact: For questions about the models or research, please open an issue or contact the authors.

About

State-of-the-art paired encoder and decoder models (17M-1B params)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •