mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

🌍 TL;DR: State-of-the-art multilingual encoder models trained on 3T tokens across 1833 languages with novel annealed language learning. Outperforms XLM-R and can even beat OpenAI's o3 and Google's Gemini 2.5 Pro.

📄 Paper | 🤗 Model Collection | 📊 Training Data

mmBERT introduces the first modern multilingual encoder trained with cascading annealed language learning (ALL), progressively incorporating 1833 languages during training. With novel inverse masking schedules and high-quality multilingual data, mmBERT significantly outperforms previous multilingual encoders while achieving remarkable efficiency improvements (up to 4x faster).

🚀 Quick Start

Installation

pip install torch>=1.9.0
pip install transformers>=4.48.0

30-Second Examples

Small Model for Fast Inference:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-small")
model = AutoModel.from_pretrained("jhu-clsp/mmbert-small")

# Example: Get multilingual embeddings
inputs = tokenizer("Hello world! 你好世界! Bonjour le monde!", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)

Base Model for Classification:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmbert-base")

# Example: Multilingual masked language modeling
text = "The capital of [MASK] is Paris."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Get predictions for [MASK] tokens
mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)
predictions = outputs.logits[mask_indices]
top_tokens = torch.topk(predictions, 5, dim=-1)
predicted_words = [tokenizer.decode(token) for token in top_tokens.indices[0]]
print(f"Predictions: {predicted_words}")

🌍 Model Family

Main Models

Size	Model	Parameters	Languages	Context	Best For	Download
Small	mmbert-small	140M	1833	8192	Fast inference, edge deployment
Base	mmbert-base	307M	1833	8192	Best performance, production use

Key Features

1833 Languages: Covers more languages than any previous multilingual encoder
Extended Context: Up to 8192 tokens (vs 512 for XLM-R)
Efficiency: 2-4x faster inference than previous multilingual models
Modern Architecture: Based on ModernBERT with RoPE, GLU activations, and Flash Attention 2
Open Training: Complete training data, recipes, and checkpoints available

🔬 Getting Started

Training Data

The complete multilingual training dataset spans 3T tokens:

Pre-training Data: 2.0T tokens across 60 languages
Mid-training Data: 600B tokens across 110 languages
Decay Phase Data: 100B tokens across 1833 languages
Data Sources: FineWeb2, DCLM, Dolmino, Wikipedia, ArXiv, and curated multilingual corpora

Usage Examples

Classification Task

from transformers import AutoTokenizer, AutoModel
import torch.nn as nn

# Load model for classification
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
encoder = AutoModel.from_pretrained("jhu-clsp/mmbert-base")

# Add classification head
class MultilingualClassifier(nn.Module):
    def __init__(self, encoder, num_classes):
        super().__init__()
        self.encoder = encoder
        self.classifier = nn.Linear(encoder.config.hidden_size, num_classes)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, input_ids, attention_mask=None):
        outputs = self.encoder(input_ids, attention_mask=attention_mask)
        pooled_output = outputs.last_hidden_state[:, 0]  # Use [CLS] token
        pooled_output = self.dropout(pooled_output)
        return self.classifier(pooled_output)

# Initialize classifier
model = MultilingualClassifier(encoder, num_classes=3)

# Example multilingual inputs
texts = [
    "This is a positive review.",
    "Ceci est un avis négatif.",
    "这是一个中性评价。"
]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
predictions = model(**inputs)

Multilingual Retrieval

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")

def get_embeddings(texts):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.numpy()

# Multilingual document retrieval
documents = [
    "Artificial intelligence is transforming healthcare.",
    "L'intelligence artificielle transforme les soins de santé.",
    "人工智能正在改变医疗保健。",
    "Climate change requires immediate action.",
    "El cambio climático requiere acción inmediata."
]

query = "AI in medicine"

# Get embeddings
doc_embeddings = get_embeddings(documents)
query_embedding = get_embeddings([query])

# Compute similarities
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
ranked_docs = np.argsort(similarities)[::-1]

print("Most similar documents:")
for i, doc_idx in enumerate(ranked_docs[:3]):
    print(f"{i+1}. {documents[doc_idx]} (score: {similarities[doc_idx]:.3f})")

Long Context Processing

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")
model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")

# Process long multilingual document (up to 8192 tokens)
long_text = """
This is a very long multilingual document...
Ceci est un très long document multilingue...
这是一个非常长的多语言文档...
""" * 100  # Simulate long text

# Tokenize with extended context
inputs = tokenizer(
    long_text, 
    return_tensors="pt", 
    max_length=8192,
    truncation=True
)

# Process efficiently with Flash Attention
with torch.no_grad():
    outputs = model(**inputs)
    
print(f"Processed {inputs['input_ids'].shape[1]} tokens")
print(f"Output shape: {outputs.last_hidden_state.shape}")

📋 Training

Using 8xH100s, training took approximately 10 days for mmBERT-small and 40 days for mmBERT-base.

Training Recipe: Cascading Annealed Language Learning

mmBERT introduces novel training techniques:

Inverse Masking Schedule: Start with 30% masking, gradually reduce to 5%
Language Progression: 60 → 110 → 1833 languages across training phases
Temperature Annealing: 0.7 → 0.5 → 0.3 for increasingly uniform language sampling
High-Quality Data: Progressive upgrade from web crawl to filtered premium sources

Training Details

Architecture

Component	Small	Base
Layers	22	22
Hidden Size	384	768
Intermediate Size	1152	1152
Attention Heads	6	12
Parameters (Total)	140M	307M
Parameters (Non-Embed)	42M	110M
Max Sequence Length	8192	8192
Vocabulary Size	256,000	256,000

Training Configuration

Data Mixture:

Pre-training (2.0T tokens): Web crawl, code, scientific papers, reference materials
Mid-training (600B tokens): Higher quality filtered data with context extension
Decay phase (100B tokens): Premium sources including textbooks and curated content

Architecture Features:

ModernBERT-based transformer with RoPE positional embeddings
GLU activations and prenorm layer normalization
Flash Attention 2 for efficient long-context processing
Gemma 2 tokenizer for multilingual coverage

Training Phases:

Base Pre-training: 60 languages, 30% masking, learning rate warmup
Context Extension: 110 languages, 15% masking, extended context to 8K
Decay Phase: 1833 languages, 5% masking, high-quality data focus

Evaluation

Evaluation code for retrieval tasks is the same as Ettin.

Evaluation code for efficiency is taken from the ModernBERT repo.

Evaluation code for NLU tasks is based on the mGTE codebase and our fork will be uploaded soon. Please raise an issue or message us if this would be helpful for you.

❓ FAQ

Q: How does mmBERT compare to XLM-R? A: mmBERT significantly outperforms XLM-R across all benchmarks:

+2.4 points average on XTREME
+3.0 points on GLUE
16x more languages (1833 vs 100)
16x longer context (8K vs 512 tokens)
2-4x faster inference

Q: Which languages does mmBERT support? A: mmBERT supports 1833 languages and scripts from FineWeb2, including:

All major world languages (English, Chinese, Spanish, etc.)
European languages (including low-resource ones like Faroese)
African languages (Swahili, Amharic, etc.)
Asian languages (Hindi, Bengali, Thai, etc.)
Many low-resource and indigenous languages

Q: How does the annealed language learning work? A: We progressively add languages in three phases:

Start with 60 high-resource languages (pre-training)
Add 50 mid-resource languages (mid-training)
Add 1723 low-resource languages (decay phase)

This allows efficient learning without overfitting on low-resource data.

Q: Can I fine-tune mmBERT for my specific task? A: Yes! mmBERT works as a drop-in replacement for XLM-R:

from transformers import AutoModel, AutoTokenizer

# Load for fine-tuning
model = AutoModel.from_pretrained("jhu-clsp/mmbert-base")
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmbert-base")

# Add task-specific head and fine-tune normally

Q: What about efficiency and memory requirements? A: mmBERT is significantly more efficient:

2-4x faster inference than XLM-R
Flash Attention 2 reduces memory usage for long sequences
Support for variable-length batching
Optimized for both CPU and GPU deployment

Q: How do I access the training data and checkpoints? A: All data and checkpoints are publicly available:

Training data: jhu-clsp/mmbert-pretraining-data
Checkpoints: jhu-clsp/mmbert-checkpoints
Github code: GitHub repository
Data processing code: Same as Ettin models

Limitations

Structured prediction tasks (NER, POS) show slightly lower scores due to tokenizer prefix space handling
Very low-resource languages still have limited training data
High-quality educational content filtering could benefit from more languages

Citation

If you use mmBERT models in your research, please cite our work:

@misc{marone2025mmbertmodernmultilingualencoder,
      title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, 
      author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
      year={2025},
      eprint={2509.06888},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.06888}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
statistics		statistics
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Table of Contents

🚀 Quick Start

Installation

30-Second Examples

🌍 Model Family

Main Models

Key Features

🔬 Getting Started

Training Data

Usage Examples

📋 Training

Training Recipe: Cascading Annealed Language Learning

Training Details

Architecture

Training Configuration

Evaluation

❓ FAQ

Limitations

Citation

About

Uh oh!

Releases

Packages

Uh oh!

JHU-CLSP/mmBERT

Folders and files

Latest commit

History

Repository files navigation

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Table of Contents

🚀 Quick Start

Installation

30-Second Examples

🌍 Model Family

Main Models

Key Features

🔬 Getting Started

Training Data

Usage Examples

📋 Training

Training Recipe: Cascading Annealed Language Learning

Training Details

Architecture

Training Configuration

Evaluation

❓ FAQ

Limitations

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages