A comprehensive pipeline for fine-tuning language models using LoRA (Low-Rank Adaptation) with advanced checkpoint management and resource optimization for MacBook GPU training.
- LoRA Fine-tuning: Efficient parameter-efficient fine-tuning
- Checkpoint Management: Automatic checkpoint saving and resumption
- Resource Optimization: Optimized for MacBook GPU (MPS) training
- Memory Management: Gradient checkpointing, mixed precision, and batch size optimization
- System Monitoring: Real-time resource usage monitoring
- Flexible Training: Resume training from any checkpoint
The pipeline includes several optimizations specifically designed for MacBook GPU training:
- Reduced Batch Size:
per_device_train_batch_size=4
(reduced from 16) - Gradient Accumulation:
gradient_accumulation_steps=8
to simulate larger batches - Mixed Precision: FP16 training enabled for memory efficiency
- Gradient Checkpointing: Trades compute for memory
- Frequent Checkpoints: Save every 500 steps instead of every epoch
- Cosine Learning Rate Scheduling: Smooth learning rate decay
- Gradient Clipping: Prevents gradient explosion
- Warmup Steps: Gradual learning rate warmup
- Multi-worker Data Loading: Parallel data loading
- Clone the repository:
git clone <repository-url>
cd finetuning-llm
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
Run the full training pipeline:
python main.py
The script will:
- Check system resources
- Load and preprocess data
- Set up the model with LoRA
- Train with automatic checkpointing
- Evaluate and compare results
If training was interrupted, resume from the latest checkpoint:
python main.py --resume
Run evaluation on a previously trained model:
python main.py --evaluate
Analyze system resources and get recommendations:
python system_check.py
List all available checkpoints:
python checkpoint_manager.py list
Test latest checkpoint against base model:
python checkpoint_manager.py test
Test specific checkpoint:
python checkpoint_manager.py test ./results/checkpoint-1000
Clean up old checkpoints (keep only the latest 3):
python checkpoint_manager.py cleanup 3
Delete a specific checkpoint:
python checkpoint_manager.py delete ./results/checkpoint-1000
All settings are centralized in config.py
:
TRAINING_ARGS = {
"per_device_train_batch_size": 4, # Reduced for memory efficiency
"gradient_accumulation_steps": 8, # Simulate larger batch size
"fp16": True, # Mixed precision
"save_steps": 500, # Save every 500 steps
"eval_steps": 500, # Evaluate every 500 steps
"warmup_steps": 100, # Learning rate warmup
"lr_scheduler_type": "cosine", # Cosine scheduling
"max_grad_norm": 1.0, # Gradient clipping
"resume_from_checkpoint": True, # Enable checkpoint resumption
}
MEMORY_OPTIMIZATION = {
"use_gradient_checkpointing": True,
"use_8bit_optimizer": False, # Enable if you have bitsandbytes
"use_4bit_quantization": False, # Enable if you have bitsandbytes
"max_memory_MB": 8000, # Adjust based on your MacBook
}
- Checkpoints are saved every 500 steps
- Only the latest 3 checkpoints are kept to save disk space
- Training automatically resumes from the latest checkpoint if interrupted
Each checkpoint contains:
- Model weights
- Optimizer state
- Learning rate scheduler state
- Training history
- Evaluation metrics
Use the checkpoint manager for advanced operations:
# List all checkpoints with details
python checkpoint_manager.py list
# Test latest checkpoint against base model
python checkpoint_manager.py test
# Test specific checkpoint
python checkpoint_manager.py test ./results/checkpoint-1000
# Clean up old checkpoints
python checkpoint_manager.py cleanup 3
# Check if resuming is possible
python checkpoint_manager.py resume
- macOS with Apple Silicon (M1/M2/M3) or Intel Mac
- 8GB RAM (16GB recommended)
- 10GB free disk space
- 16GB+ RAM
- 20GB+ free disk space
- Fast SSD storage
If you encounter memory issues:
- Reduce
per_device_train_batch_size
to 2 - Increase
gradient_accumulation_steps
to 16 - Enable
use_8bit_optimizer
if you have bitsandbytes installed - Reduce
max_length
in the configuration
If training is interrupted:
- The system will automatically resume from the latest checkpoint
- Use
python main.py --resume
to manually resume - Check available checkpoints with
python checkpoint_manager.py list
- Test checkpoint performance with
python checkpoint_manager.py test
For better performance:
- Ensure you're using MPS (Apple Silicon GPU)
- Close other applications to free up memory
- Use an external SSD for faster I/O
- Consider using 8-bit optimization if available
Run system check to get recommendations:
python system_check.py
finetuning-llm/
├── config.py # Configuration settings
├── data_loader.py # Data loading and preprocessing
├── model_setup.py # Model and LoRA setup
├── trainer.py # Training with checkpoint support
├── generation.py # Text generation and evaluation
├── checkpoint_manager.py # Checkpoint management utilities
├── system_check.py # System resource analysis
├── main.py # Main training pipeline
├── requirements.txt # Dependencies
└── README.md # This file
After training, you'll find:
./results/
- Training checkpoints and logs./fine_tuned_model/
- Final trained model./logs/
- Training logsgenerated_outputs.csv
- Comparison resultscheckpoint_XXXX_comparison.csv
- Checkpoint test results
Modify config.py
to adjust:
- Model checkpoint
- Dataset
- LoRA parameters
- Training hyperparameters
- Memory optimization settings
Replace the dataset in config.py
:
DATASET_NAME = "your-dataset-name"
Change the base model:
MODEL_CHECKPOINT = "gpt2" # or any other model
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License.