SPLADE sparse retrieval model trained for Brazilian Portuguese
Model Card β’ Usage Guide β’ Training β’ Results
SPLADE-PT-BR is a sparse neural retrieval model optimized for Brazilian Portuguese text search. Based on BERTimbau and trained on Portuguese question-answering datasets, it produces interpretable sparse vectors perfect for RAG systems and semantic search.
- π― Native Portuguese: Trained on BERTimbau with Portuguese-specific vocabulary
- β‘ Fast & Efficient: ~99.5% sparse vectors enable inverted index search
- π Semantic Expansion: Automatically expands queries with related terms
- π οΈ Easy Integration: Works with any vector database or custom retrieval systems
- π High Quality: 150K training iterations, final loss: 0.000047
# Install system dependencies
sudo apt-get update && sudo apt-get install -y python3.11-dev build-essential
# Install Python dependencies
uv syncfrom transformers import AutoTokenizer
from splade.models.transformer_rep import Splade
import torch
# Initialize SPLADE model with the trained BERT-MLM from HuggingFace
model = Splade(
model_type_or_dir="AxelPCG/splade-pt-br", # HF repo with trained BERT-MLM weights
agg="max" # Aggregation method used during training
)
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
# Set to evaluation mode and move to device
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
β οΈ Important: SPLADE is a custom architecture and cannot be loaded withAutoModel.from_pretrained(). You must instantiate theSpladeclass directly as shown above.
# Encode query
query = "Qual Γ© a capital do Brasil?"
query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True)
# Move tokens to device
query_tokens = {k: v.to(device) for k, v in query_tokens.items()}
with torch.no_grad():
query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze()
# Get sparse representation
indices = torch.nonzero(query_vec).squeeze()
if indices.dim() == 0: # Handle single element case
indices = indices.unsqueeze(0)
indices = indices.cpu().tolist()
values = query_vec[indices].cpu().tolist()
print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")
print(f"Sparsity: {(1 - len(indices) / query_vec.shape[0]) * 100:.1f}%")
# Output: ~120 / 29794 dimensions (~99.6% sparse)For complete examples including retrieval, see USAGE.md.
| Metric | Value |
|---|---|
| Base Model | BERTimbau (neuralmind/bert-base-portuguese-cased) |
| Training Dataset | mMARCO Portuguese (unicamp-dl/mmarco) |
| Validation Dataset | mRobust (unicamp-dl/mrobust) |
| Iterations | 150,000 |
| Final Loss | 0.000047 |
| Vocabulary Size | 29,794 |
| Sparsity | ~99.5% (100-150 active dims) |
Dataset: mRobust (TREC Robust04 Portuguese)
- 528,032 documents
- 250 queries
- Evaluation date: 2025-12-02
| Metric | Score | Description |
|---|---|---|
| MRR@10 | 0.453 | Mean Reciprocal Rank - First relevant doc at position ~2.2 |
Performance on Portuguese dataset (mRobust - 528k docs, 250 queries):
| Model | Language | Base Model | MRR@10 | Performance |
|---|---|---|---|---|
| SPLADE-PT-BR | Portuguese | BERTimbau | 0.453 | +18.3% better |
| SPLADE-EN | English | BERT-EN | 0.383 | Baseline |
Key Findings:
- β SPLADE-PT-BR is 18.3% better than SPLADE-EN on Portuguese queries
- β Native Portuguese training (BERTimbau + mMARCO-PT) significantly improves retrieval quality
- β MRR@10 of 0.453 means first relevant document appears at position ~2.2 on average
Interpretation: The Portuguese-adapted model demonstrates clear superiority over the English model for Portuguese IR tasks, validating the importance of language-specific training.
π For detailed evaluation metrics and comparison results, see scripts/evaluation/README.md or
evaluation_results/comparison_report_*.json
Training Phase:
- Dataset: mMARCO Portuguese (
unicamp-dl/mmarco)- Corpus:
portuguese_collection.tsv(~8.8M documents) - Training Queries:
portuguese_queries.train.tsv(training queries) - Triplets:
triples.train.ids.small.tsv(query-positive doc-negative doc triplets)
- Corpus:
- Base Model: BERTimbau (
neuralmind/bert-base-portuguese-cased) - Purpose: Learn Portuguese-specific semantic expansion and term weighting
Validation Phase (during training):
- Dataset: mRobust (
unicamp-dl/mrobust)- Used for validation checkpoints during training
- Ensures the model generalizes to unseen Portuguese data
Test/Evaluation Phase:
- Dataset: mRobust (TREC Robust04 Portuguese translation)
- Documents: 528,032 Portuguese documents
- Queries: 250 test queries in Portuguese
- QRELs: Relevance judgments (which docs are relevant for each query)
- Purpose: Final evaluation on completely unseen data
β No Data Leakage: Training (mMARCO) and testing (mRobust) are completely different datasets with different documents and queries, ensuring valid evaluation.
Training Phase:
- Dataset: MS MARCO (English)
- Corpus: ~8.8M English documents
- Training Queries: ~500k English queries
- Triplets: Query-positive-negative triplets in English
- Base Model: BERT-base-uncased (English vocabulary)
- Model:
naver/splade-cocondenser-ensembledistil
Original Test Phase:
- Datasets: BEIR, MS MARCO Dev, TREC (all in English)
- Evaluated on standard English IR benchmarks
Cross-Lingual Test (in this project):
- Dataset: mRobust Portuguese (same as SPLADE-PT-BR test)
- Purpose: Compare English model performance on Portuguese data
- Result: MRR@10 = 0.383 (significantly lower than PT-BR's 0.453)
| Aspect | SPLADE-EN | SPLADE-PT-BR |
|---|---|---|
| Training Dataset | MS MARCO (English) | mMARCO (Portuguese) |
| Training Corpus Size | ~8.8M docs (EN) | ~8.8M docs (PT) |
| Validation Dataset | MS MARCO Dev (EN) | mRobust (PT) |
| Test Dataset | BEIR/TREC (EN) | mRobust (PT) |
| Base Model | BERT-base-uncased | BERTimbau |
| Vocabulary | ~30k tokens (EN) | 29,794 tokens (PT) |
| Cross-Lingual Test | mRobust (PT) |
mRobust (PT) β |
| MRR@10 on PT data | 0.383 | 0.453 (+18.3%) |
Key Insight: The +18.3% performance improvement demonstrates that native language training is crucial for optimal retrieval quality. Cross-lingual models (ENβPT) significantly underperform compared to models trained directly on Portuguese data.
splade/data/pt/
βββ triplets/ # Training Data (mMARCO)
β βββ corpus.tsv # 8.8M Portuguese documents
β βββ queries_train.tsv # Training queries
β βββ raw.tsv # Query-doc-doc triplets
β
βββ val_retrieval/ # Test Data (mRobust)
βββ collection/
β βββ raw.tsv # 528k test documents
βββ queries/
β βββ raw.tsv # 250 test queries
βββ qrel.json # Relevance judgments
Download & Setup: All datasets are automatically downloaded by scripts/training/train_splade_pt.py from HuggingFace Hub.
Base Model: neuralmind/bert-base-portuguese-cased
Training Data: mMARCO Portuguese (unicamp-dl/mmarco)
Validation Data: mRobust (unicamp-dl/mrobust)
Iterations: 150,000
Batch Size: 8 (effective: 32 with gradient accumulation)
Learning Rate: 2e-5
Regularization: FLOPS (Ξ»_q=0.0003, Ξ»_d=0.0001)
Mixed Precision: FP16Using Training Script (Recommended):
# Full training pipeline
python scripts/training/train_splade_pt.py
# Skip completed steps
python scripts/training/train_splade_pt.py --skip-setup --skip-downloadUsing Jupyter Notebook:
The training notebook is available in notebooks/SPLADE_v2_PTBR_treinamento.ipynb.
Manual Training:
cd splade
SPLADE_CONFIG_NAME=config_splade_pt python3 -m splade.train_from_triplets_ids- The
splade/directory is not included in this repository - It is automatically cloned from
https://github.com/leobavila/splade.gitduring training - Necessary patches (AdamW, lazy loading, memory optimizations) are applied automatically
- This keeps the repository clean and ensures you get the latest SPLADE code with Portuguese-specific patches
# Run evaluation
python scripts/evaluation/run_evaluation_comparator.py --pt-onlyResults saved to evaluation_results/ with detailed metrics and execution summary.
SPLADE-PT-BR/
βββ docs/USAGE.md # Usage guide
βββ notebooks/ # Training notebook
βββ scripts/
β βββ training/train_splade_pt.py
β βββ evaluation/run_evaluation_comparator.py
βββ evaluation_results/ # Evaluation metrics
βββ splade/ # Auto-cloned during training
- SPLADE by NAVER Labs (naver/splade) and leobavila/splade fork
- BERTimbau by Neuralmind
- mMARCO & mRobust Portuguese by UNICAMP-DL
- Quati Dataset research (Bueno et al., 2024) - Inspiration for native Portuguese IR
@misc{splade-pt-br-2025,
author = {Axel Chepanski},
title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/AxelPCG/splade-pt-br}
}Apache 2.0 License