Skip to content

jashinspires/llm-lora-finetuning

Repository files navigation

🚀 LLM Fine-Tuning with QLoRA

Turn a general-purpose AI into your specialized assistant using minimal resources!

This project demonstrates how to fine-tune a large language model (LLM) to follow specific instructions better—all while running on a regular gaming GPU like an RTX 3050.


📖 What is This Project About?

The Problem

Imagine you have a smart assistant (like ChatGPT) that knows a lot about everything, but you want it to be really good at one specific thing—like classifying support tickets, doing math in a specific format, or answering questions your way.

The Solution

Instead of training a whole new AI from scratch (which would cost millions of dollars), we fine-tune an existing model. Think of it like this:

🎓 Analogy: A general practitioner doctor goes to medical school (pre-training). Later, they specialize in cardiology through additional focused training (fine-tuning). We're doing the same with AI!

Why QLoRA?

Fine-tuning normally requires expensive hardware (like $10,000+ GPUs). QLoRA is a clever technique that:

  • Compresses the model to use 4x less memory (4-bit quantization)
  • Only trains a tiny portion of the model (LoRA adapters)
  • Makes fine-tuning possible on a regular gaming laptop!

🎯 What Does This Project Do?

Step What Happens Output
1. Preprocess Creates a custom dataset with 1,000 examples data/processed/*.jsonl
2. Train Teaches the model our specific tasks outputs/lora_adapter/
3. Evaluate Compares "before vs after" performance evaluation_report.md

Result: The fine-tuned model improved from 24% accuracy to 52% (+28% improvement!)


🧠 Key Concepts Explained

What is an LLM?

A Large Language Model is an AI that understands and generates human language. Examples: ChatGPT, Claude, Llama. They're trained on billions of words from the internet.

What is Fine-Tuning?

Teaching a pre-trained model new skills by showing it examples:

Input: "Add these numbers: a=5, b=3"
Expected Output: "8"

After seeing enough examples, the model learns the pattern!

What is LoRA (Low-Rank Adaptation)?

Instead of updating all 3 billion parameters in our model (expensive!), LoRA adds small "adapter" layers:

Full Fine-Tuning:     Update 3,000,000,000 parameters 😰
LoRA Fine-Tuning:     Update only ~18,000,000 parameters 😊
                      (99.4% less!)

🎓 Analogy: Instead of remodeling your entire house, you just add a new room. Much cheaper, same result!

What is Quantization (the "Q" in QLoRA)?

Normally, each number in the model uses 32 bits of memory. Quantization compresses this to just 4 bits:

Original:   32 bits per parameter → 12 GB memory needed
Quantized:   4 bits per parameter →  3 GB memory needed

This is how we fit a 3-billion parameter model on a regular GPU!


📁 Project Structure

llm-lora-finetuning/
│
├── 📂 configs/
│   └── training.yaml        # All settings in one place
│
├── 📂 data/
│   ├── raw/                  # Original generated dataset
│   ├── processed/            # Train/validation/test splits
│   └── dataset_sample.json   # Example of what the data looks like
│
├── 📂 src/
│   ├── preprocess.py         # Step 1: Prepare the data
│   ├── train.py              # Step 2: Fine-tune the model
│   └── evaluate.py           # Step 3: Measure improvement
│
├── 📂 outputs/
│   ├── lora_adapter/         # The trained adapter (your fine-tuned model!)
│   └── checkpoints/          # Saved progress during training
│
├── 📂 runs/
│   └── tensorboard/          # Training graphs and metrics
│
├── 📂 reports/
│   ├── metrics.json          # Numbers: accuracy, scores
│   └── evaluation_report.md  # Detailed analysis
│
├── evaluation_report.md      # Same report, at root for easy access
├── requirements.txt          # Python packages needed
└── README.md                 # You are here! 👋

🛠️ Technical Specifications

Component What We're Using Why
Base Model meta-llama/Llama-3.2-3B-Instruct 3 billion parameters, instruction-tuned, open-source
Quantization 4-bit NF4 Reduces memory from 12GB to ~3GB
LoRA Rank (r) 16 Balance between quality and efficiency
LoRA Alpha 32 Scaling factor (usually 2× the rank)
Target Modules Attention + FFN layers Where the "thinking" happens
Training Steps 50 One full pass through the data
Batch Size 1 (with gradient accumulation 16) Simulates batch of 16 on low VRAM

📊 The Dataset

We created a synthetic dataset with 1,000 instruction-following examples across various tasks:

Task Type Example What It Tests
Math "Add a=5, b=3" → "8" Numerical reasoning
Sentiment "I love this!" → "positive" Classification
String Ops "Reverse 'hello'" → "olleh" Pattern following
Ticket Routing "TK-0123" → "NET" Domain-specific knowledge

The data follows the Alpaca format:

### Instruction:
Add the two integers.

### Input:
a=17, b=25

### Response:
42

Split: 800 training / 100 validation / 100 test (80/10/10)


🚀 Getting Started

Prerequisites

  • Python 3.10+ installed
  • NVIDIA GPU with at least 6GB VRAM (RTX 3050, 3060, etc.)
  • CUDA drivers installed
  • Hugging Face account (free) with Llama access approved

Step 1: Clone and Setup Environment

# Navigate to the project
cd llm-lora-finetuning

# Create a virtual environment (isolated Python installation)
python -m venv .venv

# Activate it (you'll see (.venv) in your terminal)
.\.venv\Scripts\Activate.ps1

# Upgrade pip (Python's package installer)
python -m pip install --upgrade pip

Step 2: Install PyTorch with CUDA Support

PyTorch is the AI framework. We need the CUDA version to use your GPU:

# For CUDA 12.1 (check your version with: nvidia-smi)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

💡 Tip: Run nvidia-smi in terminal to check your CUDA version

Step 3: Install Project Dependencies

pip install -r requirements.txt

This installs:

  • transformers - Hugging Face's model library
  • peft - Parameter-Efficient Fine-Tuning (LoRA)
  • trl - Training utilities
  • bitsandbytes - 4-bit quantization
  • tensorboard - Training visualization

Step 4: Login to Hugging Face

Llama models require authentication:

huggingface-cli login

You'll need a token from: https://huggingface.co/settings/tokens


▶️ Running the Pipeline

Step 1: Preprocess the Data

python src\preprocess.py --config configs\training.yaml

What it does:

  • Generates 1,000 synthetic training examples
  • Formats them in Alpaca style
  • Splits into train (800) / val (100) / test (100)
  • Saves to data/processed/

Output:

Wrote processed dataset:
- data/processed/train.jsonl
- data/processed/val.jsonl
- data/processed/test.jsonl

Step 2: Train the Model

python src\train.py --config configs\training.yaml

What it does:

  • Loads Llama-3.2-3B in 4-bit precision
  • Attaches LoRA adapters
  • Trains for 50 steps (~5-10 minutes on RTX 3050)
  • Saves the adapter to outputs/lora_adapter/

Monitor training (open in new terminal):

tensorboard --logdir runs\tensorboard

Then open http://localhost:6006 in your browser to see live graphs!

Step 3: Evaluate the Results

python src\evaluate.py --config configs\training.yaml

What it does:

  • Runs the same test questions on both:
    • Base model (before fine-tuning)
    • Fine-tuned model (after)
  • Calculates accuracy metrics
  • Generates a detailed report

Output:

Evaluation complete.
Base model accuracy: 24%
Fine-tuned accuracy: 52%
Improvement: +28%! 🎉

📈 Results

Performance Improvement

Metric Before After Change
Exact Match Accuracy 24.00% 52.00% +28%
ROUGE-L Score 0.315 0.520 +0.21

✅ This exceeds the 10% improvement threshold by a large margin!

What Improved?

  1. Format Compliance: The model learned to output just "42" instead of "The answer is 42"
  2. Task Understanding: Better at sentiment analysis, math, string operations
  3. Domain Knowledge: Learned our custom ticket→category mappings

See evaluation_report.md for detailed examples and analysis.


✅ Submission Artifacts (What We Commit)

This repository includes everything required to reproduce the pipeline end-to-end:

  • Source code: src/preprocess.py, src/train.py, src/evaluate.py
  • Config / reproducibility: configs/training.yaml
  • Custom dataset:
    • Full synthetic dataset (small): data/raw/dataset_raw.jsonl and data/processed/*.jsonl
    • Representative sample: data/dataset_sample.json
  • Training logs: TensorBoard event files under runs/tensorboard/
  • Evaluation report: evaluation_report.md at the repository root (also copied to reports/evaluation_report.md)
  • Final LoRA adapter weights: outputs/lora_adapter/

outputs/checkpoints/ is intentionally not committed (it’s not required for grading and is large/noisy).


⚙️ Configuration

All hyperparameters are in configs/training.yaml. Key settings:

model:
  name_or_path: meta-llama/Llama-3.2-3B-Instruct

lora:
  r: 16              # Rank - higher = more capacity, more memory
  alpha: 32          # Scaling factor
  target_modules:    # Which layers to adapt
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

training:
  learning_rate: 0.0002
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16
  max_steps: 50      # Used for included training run

🔧 Troubleshooting

"CUDA out of memory"

  • Reduce sft.max_seq_length to 256 in the config
  • Increase gradient_accumulation_steps to 32
  • Close other GPU applications

"Model not found" or authentication error

Training is very slow

  • Check that you're using GPU: look for "cuda" in the output
  • Make sure you installed the CUDA version of PyTorch

📚 Further Reading


📝 Files Reference

File Purpose
src/preprocess.py Generates and processes the training data
src/train.py Fine-tunes the model using QLoRA
src/evaluate.py Compares base vs fine-tuned performance
configs/training.yaml All hyperparameters and paths
evaluation_report.md Detailed results with examples
requirements.txt Python dependencies

🙏 Acknowledgments

Built using the amazing open-source ecosystem:

  • Hugging Face Transformers, PEFT, TRL
  • Meta's Llama 3.2 model
  • BitsAndBytes for quantization

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors