🚀 LLM Fine-Tuning with QLoRA

Turn a general-purpose AI into your specialized assistant using minimal resources!

This project demonstrates how to fine-tune a large language model (LLM) to follow specific instructions better—all while running on a regular gaming GPU like an RTX 3050.

📖 What is This Project About?

The Problem

Imagine you have a smart assistant (like ChatGPT) that knows a lot about everything, but you want it to be really good at one specific thing—like classifying support tickets, doing math in a specific format, or answering questions your way.

The Solution

Instead of training a whole new AI from scratch (which would cost millions of dollars), we fine-tune an existing model. Think of it like this:

🎓 Analogy: A general practitioner doctor goes to medical school (pre-training). Later, they specialize in cardiology through additional focused training (fine-tuning). We're doing the same with AI!

Why QLoRA?

Fine-tuning normally requires expensive hardware (like $10,000+ GPUs). QLoRA is a clever technique that:

Compresses the model to use 4x less memory (4-bit quantization)
Only trains a tiny portion of the model (LoRA adapters)
Makes fine-tuning possible on a regular gaming laptop!

🎯 What Does This Project Do?

Step	What Happens	Output
1. Preprocess	Creates a custom dataset with 1,000 examples	`data/processed/*.jsonl`
2. Train	Teaches the model our specific tasks	`outputs/lora_adapter/`
3. Evaluate	Compares "before vs after" performance	`evaluation_report.md`

Result: The fine-tuned model improved from 24% accuracy to 52% (+28% improvement!)

🧠 Key Concepts Explained

What is an LLM?

A Large Language Model is an AI that understands and generates human language. Examples: ChatGPT, Claude, Llama. They're trained on billions of words from the internet.

What is Fine-Tuning?

Teaching a pre-trained model new skills by showing it examples:

Input: "Add these numbers: a=5, b=3"
Expected Output: "8"

After seeing enough examples, the model learns the pattern!

What is LoRA (Low-Rank Adaptation)?

Instead of updating all 3 billion parameters in our model (expensive!), LoRA adds small "adapter" layers:

Full Fine-Tuning:     Update 3,000,000,000 parameters 😰
LoRA Fine-Tuning:     Update only ~18,000,000 parameters 😊
                      (99.4% less!)

🎓 Analogy: Instead of remodeling your entire house, you just add a new room. Much cheaper, same result!

What is Quantization (the "Q" in QLoRA)?

Normally, each number in the model uses 32 bits of memory. Quantization compresses this to just 4 bits:

Original:   32 bits per parameter → 12 GB memory needed
Quantized:   4 bits per parameter →  3 GB memory needed

This is how we fit a 3-billion parameter model on a regular GPU!

📁 Project Structure

llm-lora-finetuning/
│
├── 📂 configs/
│   └── training.yaml        # All settings in one place
│
├── 📂 data/
│   ├── raw/                  # Original generated dataset
│   ├── processed/            # Train/validation/test splits
│   └── dataset_sample.json   # Example of what the data looks like
│
├── 📂 src/
│   ├── preprocess.py         # Step 1: Prepare the data
│   ├── train.py              # Step 2: Fine-tune the model
│   └── evaluate.py           # Step 3: Measure improvement
│
├── 📂 outputs/
│   ├── lora_adapter/         # The trained adapter (your fine-tuned model!)
│   └── checkpoints/          # Saved progress during training
│
├── 📂 runs/
│   └── tensorboard/          # Training graphs and metrics
│
├── 📂 reports/
│   ├── metrics.json          # Numbers: accuracy, scores
│   └── evaluation_report.md  # Detailed analysis
│
├── evaluation_report.md      # Same report, at root for easy access
├── requirements.txt          # Python packages needed
└── README.md                 # You are here! 👋

🛠️ Technical Specifications

Component	What We're Using	Why
Base Model	`meta-llama/Llama-3.2-3B-Instruct`	3 billion parameters, instruction-tuned, open-source
Quantization	4-bit NF4	Reduces memory from 12GB to ~3GB
LoRA Rank (r)	16	Balance between quality and efficiency
LoRA Alpha	32	Scaling factor (usually 2× the rank)
Target Modules	Attention + FFN layers	Where the "thinking" happens
Training Steps	50	One full pass through the data
Batch Size	1 (with gradient accumulation 16)	Simulates batch of 16 on low VRAM

📊 The Dataset

We created a synthetic dataset with 1,000 instruction-following examples across various tasks:

Task Type	Example	What It Tests
Math	"Add a=5, b=3" → "8"	Numerical reasoning
Sentiment	"I love this!" → "positive"	Classification
String Ops	"Reverse 'hello'" → "olleh"	Pattern following
Ticket Routing	"TK-0123" → "NET"	Domain-specific knowledge

The data follows the Alpaca format:

### Instruction:
Add the two integers.

### Input:
a=17, b=25

### Response:
42

Split: 800 training / 100 validation / 100 test (80/10/10)

🚀 Getting Started

Prerequisites

Python 3.10+ installed
NVIDIA GPU with at least 6GB VRAM (RTX 3050, 3060, etc.)
CUDA drivers installed
Hugging Face account (free) with Llama access approved

Step 1: Clone and Setup Environment

# Navigate to the project
cd llm-lora-finetuning

# Create a virtual environment (isolated Python installation)
python -m venv .venv

# Activate it (you'll see (.venv) in your terminal)
.\.venv\Scripts\Activate.ps1

# Upgrade pip (Python's package installer)
python -m pip install --upgrade pip

Step 2: Install PyTorch with CUDA Support

PyTorch is the AI framework. We need the CUDA version to use your GPU:

# For CUDA 12.1 (check your version with: nvidia-smi)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

💡 Tip: Run nvidia-smi in terminal to check your CUDA version

Step 3: Install Project Dependencies

pip install -r requirements.txt

This installs:

transformers - Hugging Face's model library
peft - Parameter-Efficient Fine-Tuning (LoRA)
trl - Training utilities
bitsandbytes - 4-bit quantization
tensorboard - Training visualization

Step 4: Login to Hugging Face

Llama models require authentication:

huggingface-cli login

You'll need a token from: https://huggingface.co/settings/tokens

▶️ Running the Pipeline

Step 1: Preprocess the Data

python src\preprocess.py --config configs\training.yaml

What it does:

Generates 1,000 synthetic training examples
Formats them in Alpaca style
Splits into train (800) / val (100) / test (100)
Saves to data/processed/

Output:

Wrote processed dataset:
- data/processed/train.jsonl
- data/processed/val.jsonl
- data/processed/test.jsonl

Step 2: Train the Model

python src\train.py --config configs\training.yaml

What it does:

Loads Llama-3.2-3B in 4-bit precision
Attaches LoRA adapters
Trains for 50 steps (~5-10 minutes on RTX 3050)
Saves the adapter to outputs/lora_adapter/

Monitor training (open in new terminal):

tensorboard --logdir runs\tensorboard

Then open http://localhost:6006 in your browser to see live graphs!

Step 3: Evaluate the Results

python src\evaluate.py --config configs\training.yaml

What it does:

Runs the same test questions on both:
- Base model (before fine-tuning)
- Fine-tuned model (after)
Calculates accuracy metrics
Generates a detailed report

Output:

Evaluation complete.
Base model accuracy: 24%
Fine-tuned accuracy: 52%
Improvement: +28%! 🎉

📈 Results

Performance Improvement

Metric	Before	After	Change
Exact Match Accuracy	24.00%	52.00%	+28%
ROUGE-L Score	0.315	0.520	+0.21

✅ This exceeds the 10% improvement threshold by a large margin!

What Improved?

Format Compliance: The model learned to output just "42" instead of "The answer is 42"
Task Understanding: Better at sentiment analysis, math, string operations
Domain Knowledge: Learned our custom ticket→category mappings

See evaluation_report.md for detailed examples and analysis.

✅ Submission Artifacts (What We Commit)

This repository includes everything required to reproduce the pipeline end-to-end:

Source code: src/preprocess.py, src/train.py, src/evaluate.py
Config / reproducibility: configs/training.yaml
Custom dataset:
- Full synthetic dataset (small): data/raw/dataset_raw.jsonl and data/processed/*.jsonl
- Representative sample: data/dataset_sample.json
Training logs: TensorBoard event files under runs/tensorboard/
Evaluation report: evaluation_report.md at the repository root (also copied to reports/evaluation_report.md)
Final LoRA adapter weights: outputs/lora_adapter/

outputs/checkpoints/ is intentionally not committed (it’s not required for grading and is large/noisy).

⚙️ Configuration

All hyperparameters are in configs/training.yaml. Key settings:

model:
  name_or_path: meta-llama/Llama-3.2-3B-Instruct

lora:
  r: 16              # Rank - higher = more capacity, more memory
  alpha: 32          # Scaling factor
  target_modules:    # Which layers to adapt
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

training:
  learning_rate: 0.0002
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 16
  max_steps: 50      # Used for included training run

🔧 Troubleshooting

"CUDA out of memory"

Reduce sft.max_seq_length to 256 in the config
Increase gradient_accumulation_steps to 32
Close other GPU applications

"Model not found" or authentication error

Run huggingface-cli login again
Make sure you've accepted Llama's license at: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

Training is very slow

Check that you're using GPU: look for "cuda" in the output
Make sure you installed the CUDA version of PyTorch

📚 Further Reading

LoRA Paper - The original research
QLoRA Paper - 4-bit fine-tuning breakthrough
Hugging Face PEFT Docs - Official documentation
TRL Library - Training utilities

📝 Files Reference

File	Purpose
`src/preprocess.py`	Generates and processes the training data
`src/train.py`	Fine-tunes the model using QLoRA
`src/evaluate.py`	Compares base vs fine-tuned performance
`configs/training.yaml`	All hyperparameters and paths
`evaluation_report.md`	Detailed results with examples
`requirements.txt`	Python dependencies

🙏 Acknowledgments

Built using the amazing open-source ecosystem:

Hugging Face Transformers, PEFT, TRL
Meta's Llama 3.2 model
BitsAndBytes for quantization

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
outputs/lora_adapter		outputs/lora_adapter
reports		reports
runs/tensorboard		runs/tensorboard
src		src
.gitignore		.gitignore
README.md		README.md
evaluation_report.md		evaluation_report.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🚀 LLM Fine-Tuning with QLoRA

📖 What is This Project About?

The Problem

The Solution

Why QLoRA?

🎯 What Does This Project Do?

🧠 Key Concepts Explained

What is an LLM?

What is Fine-Tuning?

What is LoRA (Low-Rank Adaptation)?

What is Quantization (the "Q" in QLoRA)?

📁 Project Structure

🛠️ Technical Specifications

📊 The Dataset

🚀 Getting Started

Prerequisites

Step 1: Clone and Setup Environment

Step 2: Install PyTorch with CUDA Support

Step 3: Install Project Dependencies

Step 4: Login to Hugging Face

▶️ Running the Pipeline

Step 1: Preprocess the Data

Step 2: Train the Model

Step 3: Evaluate the Results

📈 Results

Performance Improvement

What Improved?

✅ Submission Artifacts (What We Commit)

⚙️ Configuration

🔧 Troubleshooting

"CUDA out of memory"

"Model not found" or authentication error

Training is very slow

📚 Further Reading

📝 Files Reference

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages