Adapter Fine-Tuning for Multilingual LLMs

This repository contains end-to-end code for:

multilingual data preparation and sampling,
tokenizer extension and tokenization pipelines,
parameter-efficient adapter training (LoRA / xLoRA),
LM Harness-based evaluation and experiment logging.

The codebase is organized to support reproducible training workflows for large language models, with scripts for both local and cluster environments.

Repository Structure

data_preparation/: dataset download and preprocessing utilities
data_sampler/: FineWeb2 sampling and dataset sizing tools
training/: baseline training scripts
language_adapters/: adapter-focused training, tokenizer extension, and analysis
language_adapters/xlora/: xLoRA training and inference utilities
evaluation/: LM Harness evaluation scripts and result export tools
calm_adapter_training/: experimental CALM adapter approach
docs/: setup notes and supporting documentation

Quickstart

Create an environment and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Prepare data:

python data_preparation/download_datasets.py --output_dir ./data/raw
python data_preparation/preprocess_data.py --input_dir ./data/raw --output_dir ./data/processed

(Optional) Sample FineWeb2:

bash data_sampler/run_fineweb2_sampler.sh

Train an adapter:

python language_adapters/train_language_adapter.py --help

Evaluate checkpoints:

python evaluation/lm_harness_single.py --help
python evaluation/lm_harness_single_model.py --help

Reproducibility Notes

Training/evaluation scripts set deterministic seeds where applicable.
Checkpoint artifacts and logs are intentionally excluded from version control.
Many run scripts are starter templates and should be adapted via CLI args or environment variables for your infrastructure.

Artifact Policy

This repository tracks source code, configs, and lightweight metadata only. Large model checkpoints, optimizer states, and run logs are excluded via .gitignore and should be stored in artifact storage (for example: W&B artifacts, object storage, or external model registries).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adapter Fine-Tuning for Multilingual LLMs

Repository Structure

Quickstart

Reproducibility Notes

Artifact Policy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
calm_adapter_training		calm_adapter_training
data_preparation		data_preparation
data_sampler		data_sampler
docs/setup		docs/setup
evaluation		evaluation
language_adapters		language_adapters
training		training
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Adapter Fine-Tuning for Multilingual LLMs

Repository Structure

Quickstart

Reproducibility Notes

Artifact Policy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages