Skip to content

SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for Portuguese text retrieval. Based on BERTimbau and trained on Portuguese question-answering datasets.

License

Notifications You must be signed in to change notification settings

AxelPCG/SPLADE-PT-BR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SPLADE-PT-BR

HuggingFace License Python 3.11+

SPLADE sparse retrieval model trained for Brazilian Portuguese

Model Card β€’ Usage Guide β€’ Training β€’ Results


πŸ“Œ Overview

SPLADE-PT-BR is a sparse neural retrieval model optimized for Brazilian Portuguese text search. Based on BERTimbau and trained on Portuguese question-answering datasets, it produces interpretable sparse vectors perfect for RAG systems and semantic search.

Why SPLADE-PT-BR?

  • 🎯 Native Portuguese: Trained on BERTimbau with Portuguese-specific vocabulary
  • ⚑ Fast & Efficient: ~99.5% sparse vectors enable inverted index search
  • πŸ” Semantic Expansion: Automatically expands queries with related terms
  • πŸ› οΈ Easy Integration: Works with any vector database or custom retrieval systems
  • πŸ“Š High Quality: 150K training iterations, final loss: 0.000047

πŸš€ Quick Start

Installation

# Install system dependencies
sudo apt-get update && sudo apt-get install -y python3.11-dev build-essential

# Install Python dependencies
uv sync

Load Model

from transformers import AutoTokenizer
from splade.models.transformer_rep import Splade
import torch

# Initialize SPLADE model with the trained BERT-MLM from HuggingFace
model = Splade(
    model_type_or_dir="AxelPCG/splade-pt-br",  # HF repo with trained BERT-MLM weights
    agg="max"  # Aggregation method used during training
)
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")

# Set to evaluation mode and move to device
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

⚠️ Important: SPLADE is a custom architecture and cannot be loaded with AutoModel.from_pretrained(). You must instantiate the Splade class directly as shown above.

Encode Text

# Encode query
query = "Qual Γ© a capital do Brasil?"
query_tokens = tokenizer(query, return_tensors="pt", max_length=256, truncation=True)

# Move tokens to device
query_tokens = {k: v.to(device) for k, v in query_tokens.items()}

with torch.no_grad():
    query_vec = model(q_kwargs=query_tokens)["q_rep"].squeeze()

# Get sparse representation
indices = torch.nonzero(query_vec).squeeze()
if indices.dim() == 0:  # Handle single element case
    indices = indices.unsqueeze(0)
indices = indices.cpu().tolist()
values = query_vec[indices].cpu().tolist()

print(f"Active dimensions: {len(indices)} / {query_vec.shape[0]}")
print(f"Sparsity: {(1 - len(indices) / query_vec.shape[0]) * 100:.1f}%")
# Output: ~120 / 29794 dimensions (~99.6% sparse)

For complete examples including retrieval, see USAGE.md.


πŸ“Š Model Details & Results

Metric Value
Base Model BERTimbau (neuralmind/bert-base-portuguese-cased)
Training Dataset mMARCO Portuguese (unicamp-dl/mmarco)
Validation Dataset mRobust (unicamp-dl/mrobust)
Iterations 150,000
Final Loss 0.000047
Vocabulary Size 29,794
Sparsity ~99.5% (100-150 active dims)

Evaluation Results

Dataset: mRobust (TREC Robust04 Portuguese)

  • 528,032 documents
  • 250 queries
  • Evaluation date: 2025-12-02

SPLADE-PT-BR Metrics

Metric Score Description
MRR@10 0.453 Mean Reciprocal Rank - First relevant doc at position ~2.2

Comparison: SPLADE-PT-BR vs SPLADE-EN

Performance on Portuguese dataset (mRobust - 528k docs, 250 queries):

Model Language Base Model MRR@10 Performance
SPLADE-PT-BR Portuguese BERTimbau 0.453 +18.3% better
SPLADE-EN English BERT-EN 0.383 Baseline

Key Findings:

  • βœ… SPLADE-PT-BR is 18.3% better than SPLADE-EN on Portuguese queries
  • βœ… Native Portuguese training (BERTimbau + mMARCO-PT) significantly improves retrieval quality
  • βœ… MRR@10 of 0.453 means first relevant document appears at position ~2.2 on average

Interpretation: The Portuguese-adapted model demonstrates clear superiority over the English model for Portuguese IR tasks, validating the importance of language-specific training.

πŸ“Š For detailed evaluation metrics and comparison results, see scripts/evaluation/README.md or evaluation_results/comparison_report_*.json


πŸ“š Dataset Split Methodology

SPLADE-PT-BR (This Model)

Training Phase:

  • Dataset: mMARCO Portuguese (unicamp-dl/mmarco)
    • Corpus: portuguese_collection.tsv (~8.8M documents)
    • Training Queries: portuguese_queries.train.tsv (training queries)
    • Triplets: triples.train.ids.small.tsv (query-positive doc-negative doc triplets)
  • Base Model: BERTimbau (neuralmind/bert-base-portuguese-cased)
  • Purpose: Learn Portuguese-specific semantic expansion and term weighting

Validation Phase (during training):

  • Dataset: mRobust (unicamp-dl/mrobust)
    • Used for validation checkpoints during training
    • Ensures the model generalizes to unseen Portuguese data

Test/Evaluation Phase:

  • Dataset: mRobust (TREC Robust04 Portuguese translation)
    • Documents: 528,032 Portuguese documents
    • Queries: 250 test queries in Portuguese
    • QRELs: Relevance judgments (which docs are relevant for each query)
  • Purpose: Final evaluation on completely unseen data

βœ… No Data Leakage: Training (mMARCO) and testing (mRobust) are completely different datasets with different documents and queries, ensuring valid evaluation.


SPLADE-EN (Original NAVER Model)

Training Phase:

  • Dataset: MS MARCO (English)
    • Corpus: ~8.8M English documents
    • Training Queries: ~500k English queries
    • Triplets: Query-positive-negative triplets in English
  • Base Model: BERT-base-uncased (English vocabulary)
  • Model: naver/splade-cocondenser-ensembledistil

Original Test Phase:

  • Datasets: BEIR, MS MARCO Dev, TREC (all in English)
  • Evaluated on standard English IR benchmarks

Cross-Lingual Test (in this project):

  • Dataset: mRobust Portuguese (same as SPLADE-PT-BR test)
  • Purpose: Compare English model performance on Portuguese data
  • Result: MRR@10 = 0.383 (significantly lower than PT-BR's 0.453)

Dataset Comparison Table

Aspect SPLADE-EN SPLADE-PT-BR
Training Dataset MS MARCO (English) mMARCO (Portuguese)
Training Corpus Size ~8.8M docs (EN) ~8.8M docs (PT)
Validation Dataset MS MARCO Dev (EN) mRobust (PT)
Test Dataset BEIR/TREC (EN) mRobust (PT)
Base Model BERT-base-uncased BERTimbau
Vocabulary ~30k tokens (EN) 29,794 tokens (PT)
Cross-Lingual Test mRobust (PT) ⚠️ mRobust (PT) βœ…
MRR@10 on PT data 0.383 0.453 (+18.3%)

Key Insight: The +18.3% performance improvement demonstrates that native language training is crucial for optimal retrieval quality. Cross-lingual models (EN→PT) significantly underperform compared to models trained directly on Portuguese data.


Data Directory Structure

splade/data/pt/
β”œβ”€β”€ triplets/              # Training Data (mMARCO)
β”‚   β”œβ”€β”€ corpus.tsv         # 8.8M Portuguese documents
β”‚   β”œβ”€β”€ queries_train.tsv  # Training queries
β”‚   └── raw.tsv            # Query-doc-doc triplets
β”‚
└── val_retrieval/         # Test Data (mRobust)
    β”œβ”€β”€ collection/
    β”‚   └── raw.tsv        # 528k test documents
    β”œβ”€β”€ queries/
    β”‚   └── raw.tsv        # 250 test queries
    └── qrel.json          # Relevance judgments

Download & Setup: All datasets are automatically downloaded by scripts/training/train_splade_pt.py from HuggingFace Hub.


πŸ”¬ Training

Configuration

Base Model: neuralmind/bert-base-portuguese-cased
Training Data: mMARCO Portuguese (unicamp-dl/mmarco)
Validation Data: mRobust (unicamp-dl/mrobust)
Iterations: 150,000
Batch Size: 8 (effective: 32 with gradient accumulation)
Learning Rate: 2e-5
Regularization: FLOPS (Ξ»_q=0.0003, Ξ»_d=0.0001)
Mixed Precision: FP16

Run Training

Using Training Script (Recommended):

# Full training pipeline
python scripts/training/train_splade_pt.py

# Skip completed steps
python scripts/training/train_splade_pt.py --skip-setup --skip-download

Using Jupyter Notebook:

The training notebook is available in notebooks/SPLADE_v2_PTBR_treinamento.ipynb.

Manual Training:

cd splade
SPLADE_CONFIG_NAME=config_splade_pt python3 -m splade.train_from_triplets_ids

Important Notes

  • The splade/ directory is not included in this repository
  • It is automatically cloned from https://github.com/leobavila/splade.git during training
  • Necessary patches (AdamW, lazy loading, memory optimizations) are applied automatically
  • This keeps the repository clean and ensures you get the latest SPLADE code with Portuguese-specific patches

πŸ“ˆ Evaluation

# Run evaluation
python scripts/evaluation/run_evaluation_comparator.py --pt-only

Results saved to evaluation_results/ with detailed metrics and execution summary.


πŸ“ Project Structure

SPLADE-PT-BR/
β”œβ”€β”€ docs/USAGE.md                # Usage guide
β”œβ”€β”€ notebooks/                   # Training notebook
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ training/train_splade_pt.py
β”‚   └── evaluation/run_evaluation_comparator.py
β”œβ”€β”€ evaluation_results/          # Evaluation metrics
└── splade/                      # Auto-cloned during training

πŸ™ Acknowledgments


πŸ“š Citation

@misc{splade-pt-br-2025,
  author = {Axel Chepanski},
  title = {SPLADE-PT-BR: Sparse Retrieval for Portuguese},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/AxelPCG/splade-pt-br}
}

πŸ“„ License

Apache 2.0 License


About

SPLADE (Sparse Lexical AnD Expansion) model fine-tuned for Portuguese text retrieval. Based on BERTimbau and trained on Portuguese question-answering datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •