XLM-SWCM (Cross-lingual Language Model with Shared Weights Cross-lingual Modeling) is an innovative sequence-to-sequence model specifically designed to address the challenges of extremely low-resource languages. Our framework introduces a novel weight-sharing mechanism between encoder and decoder components, enabling effective knowledge transfer from multilingual encoders to generation tasks.This repository contains the implementation of a multilingual sequence-to-sequence model that leverages shared weights pretraining for extremely low-resource languages. The model combines CINO-v2-base encoder with a custom interleaved transformer decoder architecture.
Primary focus on Chinese minority languages:
- Tibetan (bo)
- Uyghur (ug)
- Kazakh (kk)
- Mongolian (mn)
- Chinese (zh)
The model features:
- Encoder: CINO-v2-base for multilingual understanding
- Decoder: Custom interleaved transformer with dual FFN layers
- Hybrid Design: Combines normal and custom decoder layers
- Initialization: Leverages pre-trained encoder weights for decoder initialization
NormalDecoderLayer: Standard transformer decoder layerCustomDecoderLayer: Modified decoder with interleaved FFN architectureInterleavedTransformerDecoder: Hybrid decoder combining both layer typesSeq2SeqModel: Complete encoder-decoder architecture
Create a conda environment:
conda create -n seq2seq python=3.8
conda activate seq2seqInstall PyTorch compatible with your GPU. Visit PyTorch Official Website to get the appropriate command for your system.
For CUDA 11.8:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidiaFor CUDA 12.1:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidiaFor CPU only:
conda install pytorch torchvision torchaudio cpuonly -c pytorchpip install transformers>=4.21.0
pip install tokenizers>=0.13.0pip install torch-audio
pip install sentencepieceDownload the CINO v2 base model from Hugging Face:
Option 1: Using huggingface_hub (recommended)
# Install huggingface_hub
pip install huggingface_hub
# Download model files
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='hfl/cino-base-v2', local_dir='./base')
"Option 2: Manual download
# Create base directory
mkdir -p base- Visit: https://huggingface.co/hfl/cino-base-v2
- Download all model files (config.json, pytorch_model.bin, tokenizer files, etc.)
- Place all downloaded files in the
./base/directory
Option 3: Direct loading in code
# The model can also be loaded directly without local download
from transformers import XLMRobertaModel, XLMRobertaConfig
model = XLMRobertaModel.from_pretrained('hfl/cino-base-v2')Required files in base/ directory:
config.jsonpytorch_model.bintokenizer.jsontokenizer_config.jsonvocab.txt
Download the XLM-SWCM model weights:
# Create pretrained_model directory
mkdir -p pretrained_model
Download XLM-SWCM weights from Hugging Face (coming soon)
URL: https://huggingface.co/KEVVVV/xlm-swcm
Place the downloaded xlm-swcm.bin file in ./pretrained_model/Note: The XLM-SWCM weights will be available on Hugging Face soon. Check back for updates.
import torch
from transformers import XLMRobertaTokenizer, XLMRobertaConfig
from model import Seq2SeqModel
# Configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = "./pretrained_model/xlm-swcm.bin"
xlm_model_path = "./base"
# Load configuration and model
config = XLMRobertaConfig.from_pretrained(xlm_model_path)
model = Seq2SeqModel(
model_name_or_path=xlm_model_path,
decoder_config=config,
device=device,
tgtlen=256,
batchsize=1,
teacher_forcing=0.0
)
# Load pre-trained weights
checkpoint = torch.load(model_path, map_location=device)
model.load_state_dict(checkpoint, strict=False)
model.eval()
# Load tokenizer
tokenizer = XLMRobertaTokenizer.from_pretrained(xlm_model_path)
# Example inference
sample_text = "Your input text here"
inputs = tokenizer(sample_text, return_tensors='pt', max_length=256, truncation=True)
with torch.no_grad():
outputs = model.greedy_decode(inputs['input_ids'], inputs['attention_mask'])# Beam search decoding
beam_size = 5
n_best = 3
with torch.no_grad():
batch_hyp, batch_scores = model.beam_decode(
src_seq=inputs['input_ids'],
src_mask=inputs['attention_mask'],
beam_size=beam_size,
n_best=n_best
)
# Process results
for hyp, scores in zip(batch_hyp, batch_scores):
for h, s in zip(hyp, scores):
decoded = tokenizer.decode(h, skip_special_tokens=True)
print(f"Score: {s:.4f} | Text: {decoded}")python inference_example.pyyour-project/
βββ model.py # Main model implementation
βββ inference_example.py # Example inference script
βββ base/ # CINO v2 base model
β βββ config.json
β βββ pytorch_model.bin
β βββ tokenizer.json
β βββ tokenizer_config.json
β βββ vocab.txt
βββ pretrained_model/ # Pre-trained weights
β βββ xlm-swcm.bin
βββ transformer/ # Additional transformer utilities
β βββ Constants.py
β βββ Beam.py
βββ README.md
tgtlen: Maximum target sequence length (default: 256)batchsize: Batch size for inference (default: 1)teacher_forcing: Teacher forcing ratio during training (0.0 for inference)beam_size: Number of beams for beam search (default: 5)n_best: Number of best hypotheses to return (default: 3)
- Custom decoder layers with dual FFN structure
- Regular insertion of normal decoder layers every 3 custom layers
- Encoder weight initialization for improved convergence
If you use this model in your research, please cite:
@article{su2025multilingualencoderknowsrealize,
author = {Zeli Su and Ziyin Zhang and Guixian Xu and Jianing Liu and Xu Han and Ting Zhang and Yushuang Dong},
title = {Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining
for Extremely Low-Resource Languages},
journal = {CoRR},
volume = {abs/2502.10852},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2502.10852},
doi = {10.48550/ARXIV.2502.10852},
eprinttype = {arXiv},
eprint = {2502.10852}
}This project is released under the MIT License. See the LICENSE file for details.