Skip to content

HetanshWaghela/fine-tune-CUAD

Repository files navigation

Fine-tuning CUAD with LoRA (DistilBERT QA)

This repository fine-tunes a lightweight question answering model on CUAD (Contract Understanding Atticus Dataset) using LoRA adapters. It focuses on simplicity and speed while keeping the workflow clear and reproducible.

What you get

  • Preprocessing for CUAD-style QA data to PyTorch DataLoaders
  • LoRA setup on a QA head (default: distilbert-base-uncased)
  • Training loop with best-checkpoint saving and history logging
  • Simple evaluation loader and a LoRA sanity check utility

Quickstart

1) Environment

Tested with Python 3.10+.

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

The training loop auto-selects device: CUDA > MPS (Apple Silicon) > CPU.

2) Data

Expected files under data/:

  • train_separate_questions.json (train)
  • test.json (test)

Both follow the SQuAD-style CUAD format with fields under data -> paragraphs -> qas.

3) Train

python run_training.py

Artifacts are saved to checkpoints/:

  • best_model.pt – best validation loss checkpoint
  • training_history.json – train/val loss curves and best epoch

4) Evaluate (quick check)

python test_model.py

This loads checkpoints/best_model.pt and reports a simple loss/accuracy proxy over the test set. The accuracy is a coarse token-span match and is mainly for sanity checking, not a CUAD leaderboard metric.

5) Verify preprocessing

python test_preprocessing.py

Prints basic shapes and confirms DataLoader creation.

6) LoRA sanity check

python test_lora.py

Shows total vs trainable parameters and reduction achieved by LoRA.

Configuration

Training hyperparameters

Primary knobs live in run_training.py via TrainingConfig:

  • learning_rate, num_epochs, batch_size, warmup_steps, max_length, weight_decay, save_dir

The full TrainingConfig implementation (device auto-selection, etc.) is in train.py.

Base model

Default is DistilBERT QA:

  • Update in model_setup.load_base_model(model_name) if you want a different backbone (e.g., bert-base-uncased).

LoRA parameters

Configured in model_setup.configure_lora(...):

  • r, lora_alpha, lora_dropout, and target_modules (defaults target common attention projections).

Data paths

Paths are set in run_training.py and test_model.py:

  • data/train_separate_questions.json
  • data/test.json Adjust if your files live elsewhere.

Project structure

  • preprocess_data.py – Load CUAD JSON, parse examples, create Dataset/DataLoaders
  • model_setup.py – Load base QA model/tokenizer and apply LoRA
  • train.py – Training loop, evaluation, checkpointing, and config
  • run_training.py – End-to-end trainer entrypoint
  • test_model.py – Load best checkpoint and run a simple evaluation
  • test_lora.py – Print LoRA parameter counts and reduction
  • test_preprocessing.py – Preprocessing pipeline smoke test
  • explore_dataset.ipynb – Optional dataset exploration
  • checkpoints/ – Saved artifacts (best_model.pt, training_history.json)
  • data/ – CUAD-format JSON files

Tips and troubleshooting

  • Apple Silicon (MPS): Detected automatically; ensure recent PyTorch build supports MPS.
  • CUDA memory: Reduce batch_size and/or max_length in TrainingConfig.
  • File not found: Verify JSON paths under data/ and update in the scripts if needed.
  • Different model: Change model_name in load_base_model and confirm target_modules match the new model's attention layer names.

Reproducibility notes

  • A training_history.json is written alongside the checkpoint for later plotting/analysis.
  • Random seed for the train/val split is set to 42 in prepare_train_val_split.

Acknowledgments

  • CUAD: Contract Understanding Atticus Dataset
  • Hugging Face: transformers, datasets, peft
  • PyTorch: Core training stack

About

LoRA-based fine-tuning of a DistilBERT question-answering model on the CUAD legal contracts dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors