11# MLX LoRA Fine-tuning Optimization Configuration
22# Target: Real LoRA fine-tuning efficiency improvements while maintaining convergence
33
4- max_iterations : 40
4+ max_iterations : 60 # More iterations for breakthrough discoveries
55checkpoint_interval : 5
66log_level : " INFO"
77
1212 secondary_model : " gemini-2.5-pro-preview-06-05"
1313 secondary_model_weight : 0.3
1414 api_base : " https://generativelanguage.googleapis.com/v1beta/openai/"
15- temperature : 0.8
16- top_p : 0.9
15+ temperature : 0.9 # Higher creativity for breakthrough optimizations
16+ top_p : 0.95
1717 max_tokens : 32000
1818 timeout : 600
1919
@@ -86,13 +86,43 @@ prompt:
8686 # Reduce memory footprint during loss calculation
8787 ```
8888
89- # 🚀 PROVEN LORA OPTIMIZATION TECHNIQUES
89+ **6. UNSLOTH-STYLE MLX KERNEL FUSION** 🎯 PRIMARY SPEED TARGET
90+ ```python
91+ # Standard: Separate operations
92+ x = mx.add(input, lora_out)
93+ x = activation_fn(x)
94+ x = mx.matmul(x, next_weight)
95+
96+ # Target: Fused kernels using MLX primitives
97+ # Combine LoRA, activation, and next operation
98+ # Leverage mx.compile and mx.eval strategically
99+ ```
100+
101+ **7. Smart Gradient Accumulation**
102+ ```python
103+ # Standard: Individual gradient updates
104+ for batch in batches:
105+ loss = forward(batch)
106+ grads = backward(loss)
107+ optimizer.update(grads)
108+
109+ # Target: Accumulated updates with reduced sync points
110+ # Batch multiple LoRA layer updates together
111+ ```
112+
113+ # 🚀 UNSLOTH-INSPIRED OPTIMIZATION TECHNIQUES (Target 2x+ Speed Improvements)
90114
91- **Weight Fusion**: Pre-compute LoRA deltas when weights don't change
92- **Gradient Reuse**: Optimize gradient computation patterns for LoRA structure
93- **Memory Access Optimization**: Better cache utilization during LoRA computations
94- **Selective Computation**: Skip unnecessary computations based on LoRA rank
95- **Training-Specific Optimizations**: Leverage LoRA's low-rank structure
115+ **🔥 Flash Attention Equivalents for MLX**: Fused attention computation patterns
116+ **⚡ Kernel Fusion**: Combine LoRA operations with activation functions
117+ **🧠 Smart Gradient Accumulation**: Batch gradient updates efficiently
118+ **⭐ Optimized MLX Operations**: Leverage mx.fast for critical paths
119+ **🚀 Parameter-Efficient Updates**: Minimize optimizer state overhead
120+ **💾 Memory Mapping**: Efficient tensor reuse and allocation patterns
121+ **🎯 Selective Computation**: Skip unnecessary ops based on LoRA rank/scale
122+ **🔧 Mixed Precision**: Smart FP16/FP32 usage for speed without loss
123+
124+ Current baseline shows 1.57x memory improvement but only 1.01x speed.
125+ FOCUS: Discover speed optimizations like unsloth's 2-5x improvements!
96126
97127 # 📊 SUCCESS METRICS
98128
@@ -114,41 +144,48 @@ prompt:
114144
115145 Your optimizations should target similar patterns adapted for MLX.
116146
117- # 🚫 CONSTRAINTS
118- - Keep the same function signatures and class interfaces
119- - Maintain numerical correctness (final loss must match baseline within 1%)
120- - Support all LoRA configurations (different ranks, scales, etc.)
121- - No external dependencies beyond MLX
122- - Focus on PRACTICAL optimizations that maintain convergence
123- - 🚨 CRITICAL: Keep code changes MINIMAL and FOCUSED (under 40,000 chars)
124- - NO verbose comments, examples, or redundant code
125- - Use concise variable names and efficient implementations
126-
127- # 🔍 WHAT TO EVOLVE
128-
129- Focus on the `evolved_lora_kernels` function. The key operations to optimize:
130-
131- 1. **OptimizedLoRALinear**: Improved LoRA linear layer implementation
132- 2. **optimized_lora_training_step**: More efficient training loop
133- 3. **optimized_multi_layer_lora_application**: Batch LoRA operations
134- 4. **memory_efficient_lora_loss**: Reduced memory loss computation
135- 5. **optimized_gradient_checkpointing_lora**: Memory-efficient checkpointing
136-
137- Evolve towards optimizations that provide real efficiency gains while maintaining
138- the exact same training loss convergence as the baseline implementation.
147+ # 🚫 CONSTRAINTS
148+ - Keep exact function signatures from initial_program.py
149+ - Maintain numerical correctness (loss must match baseline within 0.01)
150+ - Support all LoRA configs (ranks 8-64, any scale/dropout)
151+ - MLX-only dependencies (mx.core, mx.nn, mx.optimizers)
152+ - 🚨 CRITICAL: Concise evolution changes (under 35,000 chars total)
153+ - NO verbose comments - focus on algorithmic improvements
154+ - Prioritize SPEED over memory (we already have 1.57x memory gain)
155+ - Test mx.compile, mx.eval, kernel fusion, gradient accumulation patterns
156+
157+ # 🔍 WHAT TO EVOLVE - TARGET UNSLOTH-STYLE 2x+ SPEED GAINS
158+
159+ Focus on `evolved_lora_kernels` function. Prioritize SPEED optimizations:
160+
161+ 1. **optimized_lora_fine_tuning**: Main training pipeline with kernel fusion
162+ 2. **optimized_training_loop**: Batch gradient accumulation like unsloth
163+ 3. **optimized_train_step**: Fused forward/backward with mx.compile
164+ 4. **optimized_linear_to_lora_layers**: Batched multi-layer LoRA application
165+ 5. **optimized_evaluate**: Fast inference with weight pre-computation
166+
167+ 🎯 PRIMARY TARGETS FOR SPEED BREAKTHROUGH:
168+ - Leverage `mx.compile()` for hot paths (like unsloth's kernel compilation)
169+ - Use `mx.eval()` strategically to minimize sync points
170+ - Batch operations across multiple LoRA layers simultaneously
171+ - Pre-compute weights when beneficial (inference mode optimization)
172+ - Implement gradient accumulation patterns that reduce memory allocations
173+
174+ Current Results: 1.57x memory ✅, 1.01x speed ❌
175+ Target: Discover 2-5x speed improvements while maintaining perfect convergence!
139176
140177 num_top_programs : 6
141178 num_diverse_programs : 4
142179
143180# Database configuration for LoRA optimization
144181database :
145182 db_path : " ./openevolve_output/program_db"
146- population_size : 60
147- archive_size : 30
183+ population_size : 80 # Larger population for more diverse explorations
184+ archive_size : 40
148185 num_islands : 4
149- elite_selection_ratio : 0.25
150- exploitation_ratio : 0.7
151- exploration_ratio : 0.3
186+ elite_selection_ratio : 0.20 # Less elite pressure, more exploration
187+ exploitation_ratio : 0.6 # Balanced exploration for breakthroughs
188+ exploration_ratio : 0.4
152189
153190# Evaluator configuration
154191evaluator :
@@ -158,4 +195,4 @@ evaluator:
158195# Evolution settings
159196diff_based_evolution : true
160197allow_full_rewrites : false
161- max_code_length : 50000
198+ max_code_length : 45000 # Encourage concise, focused optimizations
0 commit comments