This repository fine-tunes a lightweight question answering model on CUAD (Contract Understanding Atticus Dataset) using LoRA adapters. It focuses on simplicity and speed while keeping the workflow clear and reproducible.
- Preprocessing for CUAD-style QA data to PyTorch
DataLoaders - LoRA setup on a QA head (default:
distilbert-base-uncased) - Training loop with best-checkpoint saving and history logging
- Simple evaluation loader and a LoRA sanity check utility
Tested with Python 3.10+.
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtThe training loop auto-selects device: CUDA > MPS (Apple Silicon) > CPU.
Expected files under data/:
train_separate_questions.json(train)test.json(test)
Both follow the SQuAD-style CUAD format with fields under data -> paragraphs -> qas.
python run_training.pyArtifacts are saved to checkpoints/:
best_model.pt– best validation loss checkpointtraining_history.json– train/val loss curves and best epoch
python test_model.pyThis loads checkpoints/best_model.pt and reports a simple loss/accuracy proxy over the test set. The accuracy is a coarse token-span match and is mainly for sanity checking, not a CUAD leaderboard metric.
python test_preprocessing.pyPrints basic shapes and confirms DataLoader creation.
python test_lora.pyShows total vs trainable parameters and reduction achieved by LoRA.
Primary knobs live in run_training.py via TrainingConfig:
- learning_rate, num_epochs, batch_size, warmup_steps, max_length, weight_decay, save_dir
The full TrainingConfig implementation (device auto-selection, etc.) is in train.py.
Default is DistilBERT QA:
- Update in
model_setup.load_base_model(model_name)if you want a different backbone (e.g.,bert-base-uncased).
Configured in model_setup.configure_lora(...):
r,lora_alpha,lora_dropout, andtarget_modules(defaults target common attention projections).
Paths are set in run_training.py and test_model.py:
data/train_separate_questions.jsondata/test.jsonAdjust if your files live elsewhere.
preprocess_data.py– Load CUAD JSON, parse examples, createDataset/DataLoadersmodel_setup.py– Load base QA model/tokenizer and apply LoRAtrain.py– Training loop, evaluation, checkpointing, and configrun_training.py– End-to-end trainer entrypointtest_model.py– Load best checkpoint and run a simple evaluationtest_lora.py– Print LoRA parameter counts and reductiontest_preprocessing.py– Preprocessing pipeline smoke testexplore_dataset.ipynb– Optional dataset explorationcheckpoints/– Saved artifacts (best_model.pt,training_history.json)data/– CUAD-format JSON files
- Apple Silicon (MPS): Detected automatically; ensure recent PyTorch build supports MPS.
- CUDA memory: Reduce
batch_sizeand/ormax_lengthinTrainingConfig. - File not found: Verify JSON paths under
data/and update in the scripts if needed. - Different model: Change
model_nameinload_base_modeland confirmtarget_modulesmatch the new model's attention layer names.
- A
training_history.jsonis written alongside the checkpoint for later plotting/analysis. - Random seed for the train/val split is set to
42inprepare_train_val_split.
- CUAD: Contract Understanding Atticus Dataset
- Hugging Face:
transformers,datasets,peft - PyTorch: Core training stack