This repository contains end-to-end code for:
- multilingual data preparation and sampling,
- tokenizer extension and tokenization pipelines,
- parameter-efficient adapter training (LoRA / xLoRA),
- LM Harness-based evaluation and experiment logging.
The codebase is organized to support reproducible training workflows for large language models, with scripts for both local and cluster environments.
data_preparation/: dataset download and preprocessing utilitiesdata_sampler/: FineWeb2 sampling and dataset sizing toolstraining/: baseline training scriptslanguage_adapters/: adapter-focused training, tokenizer extension, and analysislanguage_adapters/xlora/: xLoRA training and inference utilitiesevaluation/: LM Harness evaluation scripts and result export toolscalm_adapter_training/: experimental CALM adapter approachdocs/: setup notes and supporting documentation
- Create an environment and install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt- Prepare data:
python data_preparation/download_datasets.py --output_dir ./data/raw
python data_preparation/preprocess_data.py --input_dir ./data/raw --output_dir ./data/processed- (Optional) Sample FineWeb2:
bash data_sampler/run_fineweb2_sampler.sh- Train an adapter:
python language_adapters/train_language_adapter.py --help- Evaluate checkpoints:
python evaluation/lm_harness_single.py --help
python evaluation/lm_harness_single_model.py --help- Training/evaluation scripts set deterministic seeds where applicable.
- Checkpoint artifacts and logs are intentionally excluded from version control.
- Many run scripts are starter templates and should be adapted via CLI args or environment variables for your infrastructure.
This repository tracks source code, configs, and lightweight metadata only. Large model checkpoints, optimizer states, and run logs are excluded via .gitignore and should be stored in artifact storage (for example: W&B artifacts, object storage, or external model registries).