LoRA Fine-tuning Pipeline with Checkpoint Support

A comprehensive pipeline for fine-tuning language models using LoRA (Low-Rank Adaptation) with advanced checkpoint management and resource optimization for MacBook GPU training.

Features

LoRA Fine-tuning: Efficient parameter-efficient fine-tuning
Checkpoint Management: Automatic checkpoint saving and resumption
Resource Optimization: Optimized for MacBook GPU (MPS) training
Memory Management: Gradient checkpointing, mixed precision, and batch size optimization
System Monitoring: Real-time resource usage monitoring
Flexible Training: Resume training from any checkpoint

Resource Optimizations for MacBook GPU

The pipeline includes several optimizations specifically designed for MacBook GPU training:

Memory Optimizations

Reduced Batch Size: per_device_train_batch_size=4 (reduced from 16)
Gradient Accumulation: gradient_accumulation_steps=8 to simulate larger batches
Mixed Precision: FP16 training enabled for memory efficiency
Gradient Checkpointing: Trades compute for memory
Frequent Checkpoints: Save every 500 steps instead of every epoch

Training Optimizations

Cosine Learning Rate Scheduling: Smooth learning rate decay
Gradient Clipping: Prevents gradient explosion
Warmup Steps: Gradual learning rate warmup
Multi-worker Data Loading: Parallel data loading

Installation

Clone the repository:

git clone <repository-url>
cd finetuning-llm

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Basic Training

Run the full training pipeline:

python main.py

The script will:

Check system resources
Load and preprocess data
Set up the model with LoRA
Train with automatic checkpointing
Evaluate and compare results

Resume Training

If training was interrupted, resume from the latest checkpoint:

python main.py --resume

Evaluation Only

Run evaluation on a previously trained model:

python main.py --evaluate

System Check

Analyze system resources and get recommendations:

python system_check.py

Checkpoint Management

List all available checkpoints:

python checkpoint_manager.py list

Test latest checkpoint against base model:

python checkpoint_manager.py test

Test specific checkpoint:

python checkpoint_manager.py test ./results/checkpoint-1000

Clean up old checkpoints (keep only the latest 3):

python checkpoint_manager.py cleanup 3

Delete a specific checkpoint:

python checkpoint_manager.py delete ./results/checkpoint-1000

Configuration

All settings are centralized in config.py:

Training Configuration

TRAINING_ARGS = {
    "per_device_train_batch_size": 4,  # Reduced for memory efficiency
    "gradient_accumulation_steps": 8,  # Simulate larger batch size
    "fp16": True,  # Mixed precision
    "save_steps": 500,  # Save every 500 steps
    "eval_steps": 500,  # Evaluate every 500 steps
    "warmup_steps": 100,  # Learning rate warmup
    "lr_scheduler_type": "cosine",  # Cosine scheduling
    "max_grad_norm": 1.0,  # Gradient clipping
    "resume_from_checkpoint": True,  # Enable checkpoint resumption
}

Memory Optimization

MEMORY_OPTIMIZATION = {
    "use_gradient_checkpointing": True,
    "use_8bit_optimizer": False,  # Enable if you have bitsandbytes
    "use_4bit_quantization": False,  # Enable if you have bitsandbytes
    "max_memory_MB": 8000,  # Adjust based on your MacBook
}

Checkpoint System

Automatic Checkpointing

Checkpoints are saved every 500 steps
Only the latest 3 checkpoints are kept to save disk space
Training automatically resumes from the latest checkpoint if interrupted

Checkpoint Information

Each checkpoint contains:

Model weights
Optimizer state
Learning rate scheduler state
Training history
Evaluation metrics

Manual Checkpoint Management

Use the checkpoint manager for advanced operations:

# List all checkpoints with details
python checkpoint_manager.py list

# Test latest checkpoint against base model
python checkpoint_manager.py test

# Test specific checkpoint
python checkpoint_manager.py test ./results/checkpoint-1000

# Clean up old checkpoints
python checkpoint_manager.py cleanup 3

# Check if resuming is possible
python checkpoint_manager.py resume

System Requirements

Minimum Requirements

macOS with Apple Silicon (M1/M2/M3) or Intel Mac
8GB RAM (16GB recommended)
10GB free disk space

Recommended Setup

16GB+ RAM
20GB+ free disk space
Fast SSD storage

Troubleshooting

Memory Issues

If you encounter memory issues:

Reduce per_device_train_batch_size to 2
Increase gradient_accumulation_steps to 16
Enable use_8bit_optimizer if you have bitsandbytes installed
Reduce max_length in the configuration

Training Interruption

If training is interrupted:

The system will automatically resume from the latest checkpoint
Use python main.py --resume to manually resume
Check available checkpoints with python checkpoint_manager.py list
Test checkpoint performance with python checkpoint_manager.py test

Performance Optimization

For better performance:

Ensure you're using MPS (Apple Silicon GPU)
Close other applications to free up memory
Use an external SSD for faster I/O
Consider using 8-bit optimization if available

System Analysis

Run system check to get recommendations:

python system_check.py

File Structure

finetuning-llm/
├── config.py              # Configuration settings
├── data_loader.py         # Data loading and preprocessing
├── model_setup.py         # Model and LoRA setup
├── trainer.py             # Training with checkpoint support
├── generation.py          # Text generation and evaluation
├── checkpoint_manager.py  # Checkpoint management utilities
├── system_check.py        # System resource analysis
├── main.py               # Main training pipeline
├── requirements.txt       # Dependencies
└── README.md             # This file

Output Files

After training, you'll find:

./results/ - Training checkpoints and logs
./fine_tuned_model/ - Final trained model
./logs/ - Training logs
generated_outputs.csv - Comparison results
checkpoint_XXXX_comparison.csv - Checkpoint test results

Advanced Usage

Custom Configuration

Modify config.py to adjust:

Model checkpoint
Dataset
LoRA parameters
Training hyperparameters
Memory optimization settings

Custom Datasets

Replace the dataset in config.py:

DATASET_NAME = "your-dataset-name"

Different Models

Change the base model:

MODEL_CHECKPOINT = "gpt2"  # or any other model

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
UNDERGRAD_EXPLANATION.md		UNDERGRAD_EXPLANATION.md
config.py		config.py
data_loader.py		data_loader.py
example_usage.py		example_usage.py
generation.py		generation.py
lower_rank_adaption_fine_tuning.ipynb		lower_rank_adaption_fine_tuning.ipynb
main.py		main.py
model_setup.py		model_setup.py
requirements.txt		requirements.txt
server_info		server_info
system_check.py		system_check.py
test.txt		test.txt
test_base_model.py		test_base_model.py
test_device.py		test_device.py
trainer.py		trainer.py

marsalan06/finetuning-llm

Folders and files

Latest commit

History

Repository files navigation