GitHub - HetanshWaghela/fine-tune-CUAD: LoRA-based fine-tuning of a DistilBERT question-answering model on the CUAD legal contracts dataset.

Fine-tuning CUAD with LoRA (DistilBERT QA)

This repository fine-tunes a lightweight question answering model on CUAD (Contract Understanding Atticus Dataset) using LoRA adapters. It focuses on simplicity and speed while keeping the workflow clear and reproducible.

What you get

Preprocessing for CUAD-style QA data to PyTorch DataLoaders
LoRA setup on a QA head (default: distilbert-base-uncased)
Training loop with best-checkpoint saving and history logging
Simple evaluation loader and a LoRA sanity check utility

Quickstart

1) Environment

Tested with Python 3.10+.

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

The training loop auto-selects device: CUDA > MPS (Apple Silicon) > CPU.

2) Data

Expected files under data/:

train_separate_questions.json (train)
test.json (test)

Both follow the SQuAD-style CUAD format with fields under data -> paragraphs -> qas.

3) Train

python run_training.py

Artifacts are saved to checkpoints/:

best_model.pt – best validation loss checkpoint
training_history.json – train/val loss curves and best epoch

4) Evaluate (quick check)

python test_model.py

This loads checkpoints/best_model.pt and reports a simple loss/accuracy proxy over the test set. The accuracy is a coarse token-span match and is mainly for sanity checking, not a CUAD leaderboard metric.

5) Verify preprocessing

python test_preprocessing.py

Prints basic shapes and confirms DataLoader creation.

6) LoRA sanity check

python test_lora.py

Shows total vs trainable parameters and reduction achieved by LoRA.

Configuration

Training hyperparameters

Primary knobs live in run_training.py via TrainingConfig:

learning_rate, num_epochs, batch_size, warmup_steps, max_length, weight_decay, save_dir

The full TrainingConfig implementation (device auto-selection, etc.) is in train.py.

Base model

Default is DistilBERT QA:

Update in model_setup.load_base_model(model_name) if you want a different backbone (e.g., bert-base-uncased).

LoRA parameters

Configured in model_setup.configure_lora(...):

r, lora_alpha, lora_dropout, and target_modules (defaults target common attention projections).

Data paths

Paths are set in run_training.py and test_model.py:

data/train_separate_questions.json
data/test.json Adjust if your files live elsewhere.

Project structure

preprocess_data.py – Load CUAD JSON, parse examples, create Dataset/DataLoaders
model_setup.py – Load base QA model/tokenizer and apply LoRA
train.py – Training loop, evaluation, checkpointing, and config
run_training.py – End-to-end trainer entrypoint
test_model.py – Load best checkpoint and run a simple evaluation
test_lora.py – Print LoRA parameter counts and reduction
test_preprocessing.py – Preprocessing pipeline smoke test
explore_dataset.ipynb – Optional dataset exploration
checkpoints/ – Saved artifacts (best_model.pt, training_history.json)
data/ – CUAD-format JSON files

Tips and troubleshooting

Apple Silicon (MPS): Detected automatically; ensure recent PyTorch build supports MPS.
CUDA memory: Reduce batch_size and/or max_length in TrainingConfig.
File not found: Verify JSON paths under data/ and update in the scripts if needed.
Different model: Change model_name in load_base_model and confirm target_modules match the new model's attention layer names.

Reproducibility notes

A training_history.json is written alongside the checkpoint for later plotting/analysis.
Random seed for the train/val split is set to 42 in prepare_train_val_split.

Acknowledgments

CUAD: Contract Understanding Atticus Dataset
Hugging Face: transformers, datasets, peft
PyTorch: Core training stack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-tuning CUAD with LoRA (DistilBERT QA)

What you get

Quickstart

1) Environment

2) Data

3) Train

4) Evaluate (quick check)

5) Verify preprocessing

6) LoRA sanity check

Configuration

Training hyperparameters

Base model

LoRA parameters

Data paths

Project structure

Tips and troubleshooting

Reproducibility notes

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
explore_dataset.ipynb		explore_dataset.ipynb
model_setup.py		model_setup.py
preprocess_data.py		preprocess_data.py
requirements.txt		requirements.txt
run_training.py		run_training.py
test_lora.py		test_lora.py
test_model.py		test_model.py
test_preprocessing.py		test_preprocessing.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning CUAD with LoRA (DistilBERT QA)

What you get

Quickstart

1) Environment

2) Data

3) Train

4) Evaluate (quick check)

5) Verify preprocessing

6) LoRA sanity check

Configuration

Training hyperparameters

Base model

LoRA parameters

Data paths

Project structure

Tips and troubleshooting

Reproducibility notes

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages