Turn a general-purpose AI into your specialized assistant using minimal resources!
This project demonstrates how to fine-tune a large language model (LLM) to follow specific instructions better—all while running on a regular gaming GPU like an RTX 3050.
Imagine you have a smart assistant (like ChatGPT) that knows a lot about everything, but you want it to be really good at one specific thing—like classifying support tickets, doing math in a specific format, or answering questions your way.
Instead of training a whole new AI from scratch (which would cost millions of dollars), we fine-tune an existing model. Think of it like this:
🎓 Analogy: A general practitioner doctor goes to medical school (pre-training). Later, they specialize in cardiology through additional focused training (fine-tuning). We're doing the same with AI!
Fine-tuning normally requires expensive hardware (like $10,000+ GPUs). QLoRA is a clever technique that:
- Compresses the model to use 4x less memory (4-bit quantization)
- Only trains a tiny portion of the model (LoRA adapters)
- Makes fine-tuning possible on a regular gaming laptop!
| Step | What Happens | Output |
|---|---|---|
| 1. Preprocess | Creates a custom dataset with 1,000 examples | data/processed/*.jsonl |
| 2. Train | Teaches the model our specific tasks | outputs/lora_adapter/ |
| 3. Evaluate | Compares "before vs after" performance | evaluation_report.md |
Result: The fine-tuned model improved from 24% accuracy to 52% (+28% improvement!)
A Large Language Model is an AI that understands and generates human language. Examples: ChatGPT, Claude, Llama. They're trained on billions of words from the internet.
Teaching a pre-trained model new skills by showing it examples:
Input: "Add these numbers: a=5, b=3"
Expected Output: "8"
After seeing enough examples, the model learns the pattern!
Instead of updating all 3 billion parameters in our model (expensive!), LoRA adds small "adapter" layers:
Full Fine-Tuning: Update 3,000,000,000 parameters 😰
LoRA Fine-Tuning: Update only ~18,000,000 parameters 😊
(99.4% less!)
🎓 Analogy: Instead of remodeling your entire house, you just add a new room. Much cheaper, same result!
Normally, each number in the model uses 32 bits of memory. Quantization compresses this to just 4 bits:
Original: 32 bits per parameter → 12 GB memory needed
Quantized: 4 bits per parameter → 3 GB memory needed
This is how we fit a 3-billion parameter model on a regular GPU!
llm-lora-finetuning/
│
├── 📂 configs/
│ └── training.yaml # All settings in one place
│
├── 📂 data/
│ ├── raw/ # Original generated dataset
│ ├── processed/ # Train/validation/test splits
│ └── dataset_sample.json # Example of what the data looks like
│
├── 📂 src/
│ ├── preprocess.py # Step 1: Prepare the data
│ ├── train.py # Step 2: Fine-tune the model
│ └── evaluate.py # Step 3: Measure improvement
│
├── 📂 outputs/
│ ├── lora_adapter/ # The trained adapter (your fine-tuned model!)
│ └── checkpoints/ # Saved progress during training
│
├── 📂 runs/
│ └── tensorboard/ # Training graphs and metrics
│
├── 📂 reports/
│ ├── metrics.json # Numbers: accuracy, scores
│ └── evaluation_report.md # Detailed analysis
│
├── evaluation_report.md # Same report, at root for easy access
├── requirements.txt # Python packages needed
└── README.md # You are here! 👋
| Component | What We're Using | Why |
|---|---|---|
| Base Model | meta-llama/Llama-3.2-3B-Instruct |
3 billion parameters, instruction-tuned, open-source |
| Quantization | 4-bit NF4 | Reduces memory from 12GB to ~3GB |
| LoRA Rank (r) | 16 | Balance between quality and efficiency |
| LoRA Alpha | 32 | Scaling factor (usually 2× the rank) |
| Target Modules | Attention + FFN layers | Where the "thinking" happens |
| Training Steps | 50 | One full pass through the data |
| Batch Size | 1 (with gradient accumulation 16) | Simulates batch of 16 on low VRAM |
We created a synthetic dataset with 1,000 instruction-following examples across various tasks:
| Task Type | Example | What It Tests |
|---|---|---|
| Math | "Add a=5, b=3" → "8" | Numerical reasoning |
| Sentiment | "I love this!" → "positive" | Classification |
| String Ops | "Reverse 'hello'" → "olleh" | Pattern following |
| Ticket Routing | "TK-0123" → "NET" | Domain-specific knowledge |
The data follows the Alpaca format:
### Instruction:
Add the two integers.
### Input:
a=17, b=25
### Response:
42
Split: 800 training / 100 validation / 100 test (80/10/10)
- Python 3.10+ installed
- NVIDIA GPU with at least 6GB VRAM (RTX 3050, 3060, etc.)
- CUDA drivers installed
- Hugging Face account (free) with Llama access approved
# Navigate to the project
cd llm-lora-finetuning
# Create a virtual environment (isolated Python installation)
python -m venv .venv
# Activate it (you'll see (.venv) in your terminal)
.\.venv\Scripts\Activate.ps1
# Upgrade pip (Python's package installer)
python -m pip install --upgrade pipPyTorch is the AI framework. We need the CUDA version to use your GPU:
# For CUDA 12.1 (check your version with: nvidia-smi)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121💡 Tip: Run
nvidia-smiin terminal to check your CUDA version
pip install -r requirements.txtThis installs:
transformers- Hugging Face's model librarypeft- Parameter-Efficient Fine-Tuning (LoRA)trl- Training utilitiesbitsandbytes- 4-bit quantizationtensorboard- Training visualization
Llama models require authentication:
huggingface-cli loginYou'll need a token from: https://huggingface.co/settings/tokens
python src\preprocess.py --config configs\training.yamlWhat it does:
- Generates 1,000 synthetic training examples
- Formats them in Alpaca style
- Splits into train (800) / val (100) / test (100)
- Saves to
data/processed/
Output:
Wrote processed dataset:
- data/processed/train.jsonl
- data/processed/val.jsonl
- data/processed/test.jsonl
python src\train.py --config configs\training.yamlWhat it does:
- Loads Llama-3.2-3B in 4-bit precision
- Attaches LoRA adapters
- Trains for 50 steps (~5-10 minutes on RTX 3050)
- Saves the adapter to
outputs/lora_adapter/
Monitor training (open in new terminal):
tensorboard --logdir runs\tensorboardThen open http://localhost:6006 in your browser to see live graphs!
python src\evaluate.py --config configs\training.yamlWhat it does:
- Runs the same test questions on both:
- Base model (before fine-tuning)
- Fine-tuned model (after)
- Calculates accuracy metrics
- Generates a detailed report
Output:
Evaluation complete.
Base model accuracy: 24%
Fine-tuned accuracy: 52%
Improvement: +28%! 🎉
| Metric | Before | After | Change |
|---|---|---|---|
| Exact Match Accuracy | 24.00% | 52.00% | +28% |
| ROUGE-L Score | 0.315 | 0.520 | +0.21 |
✅ This exceeds the 10% improvement threshold by a large margin!
- Format Compliance: The model learned to output just "42" instead of "The answer is 42"
- Task Understanding: Better at sentiment analysis, math, string operations
- Domain Knowledge: Learned our custom ticket→category mappings
See evaluation_report.md for detailed examples and analysis.
This repository includes everything required to reproduce the pipeline end-to-end:
- Source code:
src/preprocess.py,src/train.py,src/evaluate.py - Config / reproducibility:
configs/training.yaml - Custom dataset:
- Full synthetic dataset (small):
data/raw/dataset_raw.jsonlanddata/processed/*.jsonl - Representative sample:
data/dataset_sample.json
- Full synthetic dataset (small):
- Training logs: TensorBoard event files under
runs/tensorboard/ - Evaluation report:
evaluation_report.mdat the repository root (also copied toreports/evaluation_report.md) - Final LoRA adapter weights:
outputs/lora_adapter/
outputs/checkpoints/ is intentionally not committed (it’s not required for grading and is large/noisy).
All hyperparameters are in configs/training.yaml. Key settings:
model:
name_or_path: meta-llama/Llama-3.2-3B-Instruct
lora:
r: 16 # Rank - higher = more capacity, more memory
alpha: 32 # Scaling factor
target_modules: # Which layers to adapt
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
training:
learning_rate: 0.0002
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
max_steps: 50 # Used for included training run- Reduce
sft.max_seq_lengthto 256 in the config - Increase
gradient_accumulation_stepsto 32 - Close other GPU applications
- Run
huggingface-cli loginagain - Make sure you've accepted Llama's license at: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- Check that you're using GPU: look for "cuda" in the output
- Make sure you installed the CUDA version of PyTorch
- LoRA Paper - The original research
- QLoRA Paper - 4-bit fine-tuning breakthrough
- Hugging Face PEFT Docs - Official documentation
- TRL Library - Training utilities
| File | Purpose |
|---|---|
src/preprocess.py |
Generates and processes the training data |
src/train.py |
Fine-tunes the model using QLoRA |
src/evaluate.py |
Compares base vs fine-tuned performance |
configs/training.yaml |
All hyperparameters and paths |
evaluation_report.md |
Detailed results with examples |
requirements.txt |
Python dependencies |
Built using the amazing open-source ecosystem:
- Hugging Face Transformers, PEFT, TRL
- Meta's Llama 3.2 model
- BitsAndBytes for quantization