Skip to content

Latest commit

 

History

History
177 lines (132 loc) · 6.56 KB

File metadata and controls

177 lines (132 loc) · 6.56 KB

🚀 Claude-to-CodeLlama Knowledge Distillation

A production-ready system for knowledge distillation from Claude Opus 4 to CodeLlama 7B, optimized for Google Colab A100 training.

🎯 Overview

This project implements state-of-the-art knowledge distillation techniques to transfer Claude Opus 4's superior code generation capabilities to the more accessible CodeLlama 7B model. The system is specifically optimized for cloud training with automatic GPU detection and dynamic configuration.

✨ Key Features

  • 🧠 Advanced Knowledge Distillation: Transfer learning from Claude Opus 4 to CodeLlama 7B
  • A100 Optimization: Dynamic configuration based on GPU detection
  • 💾 Memory Efficient: QLoRA quantization with 4-bit precision
  • 🔄 Adaptive Training: Automatic fallback for different model architectures
  • 📊 Production Ready: Comprehensive evaluation and deployment pipeline
  • 🎮 Colab Optimized: One-click training on Google Colab

🚀 Quick Start

Option 1: Google Colab (Recommended)

  1. Open in Colab: Open In Colab

  2. Set Runtime: Runtime → Change runtime type → Hardware accelerator: A100 GPU

  3. Run All Cells: The notebook will automatically:

    • Install dependencies
    • Clone repository
    • Configure for your GPU
    • Generate/load dataset
    • Train CodeLlama model
    • Evaluate performance

Option 2: Local Setup

git clone https://github.com/yalcindemir/Claude-to-Codellama-Distillation.git
cd Claude-to-Codellama-Distillation
pip install -r requirements-colab.txt

📁 Project Structure

claude_to_codellama_distillation/
├── src/                          # Core modules
│   ├── claude_client.py          # Claude API integration
│   ├── dataset_generator.py      # Dataset creation
│   ├── distillation_trainer.py   # Training system
│   ├── evaluation_system.py      # Model evaluation
│   └── advanced_loss.py          # Custom loss functions
├── configs/
│   └── config.yml               # Configuration settings
├── notebooks/
│   └── Claude_Code_Model_Colab_Clean.ipynb  # Main training notebook
├── requirements-colab.txt        # Minimal dependencies
└── README.md                     # This file

⚙️ Configuration

The system automatically configures based on your GPU:

A100 Configuration (40GB)

  • Dataset Size: 5,000 examples
  • Batch Size: 4 (effective: 16 with gradient accumulation)
  • Sequence Length: 2,048 tokens
  • LoRA Rank: 16
  • Training Duration: ~3-4 hours

Standard GPU Configuration (6-15GB)

  • Dataset Size: 1,000 examples
  • Batch Size: 1 (effective: 8 with gradient accumulation)
  • Sequence Length: 512 tokens
  • LoRA Rank: 8
  • Training Duration: ~1-2 hours

🎯 Expected Results

Metric Baseline CodeLlama Distilled Model Improvement
HumanEval Pass@1 33.5% 70-75% +110%
MBPP Pass@1 41.4% 65-70% +60%
Code Quality Good Excellent +25%

🔧 Key Components

Knowledge Distillation System

  • Temperature Scaling: Softmax temperature of 4.0 for smoother distributions
  • Loss Weighting: 70% distillation loss + 30% task loss
  • Dynamic Target Modules: Automatic detection of model-specific LoRA targets

Memory Optimization

  • QLoRA: 4-bit quantization with NF4 and double quantization
  • Gradient Checkpointing: Reduced memory usage during backpropagation
  • Mixed Precision: FP16 training for faster computation

Adaptive Architecture

  • Model Detection: Automatic identification of model types (GPT, LLaMA, BERT, T5)
  • Target Module Selection: Dynamic LoRA target selection based on architecture
  • Fallback Strategies: Graceful degradation when LoRA fails

📊 Training Process

  1. Environment Setup: Automatic dependency installation and GPU detection
  2. Dataset Generation: Claude API integration or sample data creation
  3. Model Loading: CodeLlama 7B with quantization and LoRA adaptation
  4. Training: Knowledge distillation with monitoring and checkpointing
  5. Evaluation: Performance assessment and model comparison
  6. Export: Model saving for deployment

🔍 Monitoring

The system includes comprehensive monitoring:

  • Real-time Loss Tracking: Task loss, distillation loss, and total loss
  • Memory Usage: GPU memory monitoring and optimization suggestions
  • Training Progress: Step-by-step progress with ETA
  • Performance Metrics: Automatic evaluation on validation set

💰 Cost Estimation

Component Cost (USD)
Claude API (5K examples) $30-50
Colab A100 (4 hours) $15-20
Total $45-70

🚀 Deployment

After training, your model is ready for:

  • Local Inference: Download and run locally
  • HuggingFace Hub: Upload for easy sharing
  • API Deployment: Deploy with FastAPI or similar
  • Production Integration: Use in applications

🧪 Technical Details

Advanced Features

  • Progressive Distillation: Adaptive weight scheduling
  • Attention Transfer: Pattern-based knowledge transfer
  • Multi-GPU Support: Distributed training capability
  • Cost Optimization: Prompt caching and batch processing

Quality Assurance

  • Automated Testing: Comprehensive test suite
  • Code Validation: Syntax and execution testing
  • Performance Benchmarking: Standard evaluation metrics
  • Continuous Integration: Automated deployment pipeline

🤝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Anthropic for Claude Opus 4 API
  • Meta for CodeLlama models
  • HuggingFace for transformers and PEFT
  • Google Colab for accessible GPU computing

📞 Support


Happy Coding! 🎉