Skip to content

Commit 42e760c

Browse files
committed
Update config.yaml
1 parent b00f3cf commit 42e760c

File tree

1 file changed

+74
-37
lines changed

1 file changed

+74
-37
lines changed

examples/mlx_fine_tuning_kernels/config.yaml

Lines changed: 74 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# MLX LoRA Fine-tuning Optimization Configuration
22
# Target: Real LoRA fine-tuning efficiency improvements while maintaining convergence
33

4-
max_iterations: 40
4+
max_iterations: 60 # More iterations for breakthrough discoveries
55
checkpoint_interval: 5
66
log_level: "INFO"
77

@@ -12,8 +12,8 @@ llm:
1212
secondary_model: "gemini-2.5-pro-preview-06-05"
1313
secondary_model_weight: 0.3
1414
api_base: "https://generativelanguage.googleapis.com/v1beta/openai/"
15-
temperature: 0.8
16-
top_p: 0.9
15+
temperature: 0.9 # Higher creativity for breakthrough optimizations
16+
top_p: 0.95
1717
max_tokens: 32000
1818
timeout: 600
1919

@@ -86,13 +86,43 @@ prompt:
8686
# Reduce memory footprint during loss calculation
8787
```
8888
89-
# 🚀 PROVEN LORA OPTIMIZATION TECHNIQUES
89+
**6. UNSLOTH-STYLE MLX KERNEL FUSION** 🎯 PRIMARY SPEED TARGET
90+
```python
91+
# Standard: Separate operations
92+
x = mx.add(input, lora_out)
93+
x = activation_fn(x)
94+
x = mx.matmul(x, next_weight)
95+
96+
# Target: Fused kernels using MLX primitives
97+
# Combine LoRA, activation, and next operation
98+
# Leverage mx.compile and mx.eval strategically
99+
```
100+
101+
**7. Smart Gradient Accumulation**
102+
```python
103+
# Standard: Individual gradient updates
104+
for batch in batches:
105+
loss = forward(batch)
106+
grads = backward(loss)
107+
optimizer.update(grads)
108+
109+
# Target: Accumulated updates with reduced sync points
110+
# Batch multiple LoRA layer updates together
111+
```
112+
113+
# 🚀 UNSLOTH-INSPIRED OPTIMIZATION TECHNIQUES (Target 2x+ Speed Improvements)
90114
91-
**Weight Fusion**: Pre-compute LoRA deltas when weights don't change
92-
**Gradient Reuse**: Optimize gradient computation patterns for LoRA structure
93-
**Memory Access Optimization**: Better cache utilization during LoRA computations
94-
**Selective Computation**: Skip unnecessary computations based on LoRA rank
95-
**Training-Specific Optimizations**: Leverage LoRA's low-rank structure
115+
**🔥 Flash Attention Equivalents for MLX**: Fused attention computation patterns
116+
**⚡ Kernel Fusion**: Combine LoRA operations with activation functions
117+
**🧠 Smart Gradient Accumulation**: Batch gradient updates efficiently
118+
**⭐ Optimized MLX Operations**: Leverage mx.fast for critical paths
119+
**🚀 Parameter-Efficient Updates**: Minimize optimizer state overhead
120+
**💾 Memory Mapping**: Efficient tensor reuse and allocation patterns
121+
**🎯 Selective Computation**: Skip unnecessary ops based on LoRA rank/scale
122+
**🔧 Mixed Precision**: Smart FP16/FP32 usage for speed without loss
123+
124+
Current baseline shows 1.57x memory improvement but only 1.01x speed.
125+
FOCUS: Discover speed optimizations like unsloth's 2-5x improvements!
96126
97127
# 📊 SUCCESS METRICS
98128
@@ -114,41 +144,48 @@ prompt:
114144
115145
Your optimizations should target similar patterns adapted for MLX.
116146
117-
# 🚫 CONSTRAINTS
118-
- Keep the same function signatures and class interfaces
119-
- Maintain numerical correctness (final loss must match baseline within 1%)
120-
- Support all LoRA configurations (different ranks, scales, etc.)
121-
- No external dependencies beyond MLX
122-
- Focus on PRACTICAL optimizations that maintain convergence
123-
- 🚨 CRITICAL: Keep code changes MINIMAL and FOCUSED (under 40,000 chars)
124-
- NO verbose comments, examples, or redundant code
125-
- Use concise variable names and efficient implementations
126-
127-
# 🔍 WHAT TO EVOLVE
128-
129-
Focus on the `evolved_lora_kernels` function. The key operations to optimize:
130-
131-
1. **OptimizedLoRALinear**: Improved LoRA linear layer implementation
132-
2. **optimized_lora_training_step**: More efficient training loop
133-
3. **optimized_multi_layer_lora_application**: Batch LoRA operations
134-
4. **memory_efficient_lora_loss**: Reduced memory loss computation
135-
5. **optimized_gradient_checkpointing_lora**: Memory-efficient checkpointing
136-
137-
Evolve towards optimizations that provide real efficiency gains while maintaining
138-
the exact same training loss convergence as the baseline implementation.
147+
# 🚫 CONSTRAINTS
148+
- Keep exact function signatures from initial_program.py
149+
- Maintain numerical correctness (loss must match baseline within 0.01)
150+
- Support all LoRA configs (ranks 8-64, any scale/dropout)
151+
- MLX-only dependencies (mx.core, mx.nn, mx.optimizers)
152+
- 🚨 CRITICAL: Concise evolution changes (under 35,000 chars total)
153+
- NO verbose comments - focus on algorithmic improvements
154+
- Prioritize SPEED over memory (we already have 1.57x memory gain)
155+
- Test mx.compile, mx.eval, kernel fusion, gradient accumulation patterns
156+
157+
# 🔍 WHAT TO EVOLVE - TARGET UNSLOTH-STYLE 2x+ SPEED GAINS
158+
159+
Focus on `evolved_lora_kernels` function. Prioritize SPEED optimizations:
160+
161+
1. **optimized_lora_fine_tuning**: Main training pipeline with kernel fusion
162+
2. **optimized_training_loop**: Batch gradient accumulation like unsloth
163+
3. **optimized_train_step**: Fused forward/backward with mx.compile
164+
4. **optimized_linear_to_lora_layers**: Batched multi-layer LoRA application
165+
5. **optimized_evaluate**: Fast inference with weight pre-computation
166+
167+
🎯 PRIMARY TARGETS FOR SPEED BREAKTHROUGH:
168+
- Leverage `mx.compile()` for hot paths (like unsloth's kernel compilation)
169+
- Use `mx.eval()` strategically to minimize sync points
170+
- Batch operations across multiple LoRA layers simultaneously
171+
- Pre-compute weights when beneficial (inference mode optimization)
172+
- Implement gradient accumulation patterns that reduce memory allocations
173+
174+
Current Results: 1.57x memory ✅, 1.01x speed ❌
175+
Target: Discover 2-5x speed improvements while maintaining perfect convergence!
139176
140177
num_top_programs: 6
141178
num_diverse_programs: 4
142179

143180
# Database configuration for LoRA optimization
144181
database:
145182
db_path: "./openevolve_output/program_db"
146-
population_size: 60
147-
archive_size: 30
183+
population_size: 80 # Larger population for more diverse explorations
184+
archive_size: 40
148185
num_islands: 4
149-
elite_selection_ratio: 0.25
150-
exploitation_ratio: 0.7
151-
exploration_ratio: 0.3
186+
elite_selection_ratio: 0.20 # Less elite pressure, more exploration
187+
exploitation_ratio: 0.6 # Balanced exploration for breakthroughs
188+
exploration_ratio: 0.4
152189

153190
# Evaluator configuration
154191
evaluator:
@@ -158,4 +195,4 @@ evaluator:
158195
# Evolution settings
159196
diff_based_evolution: true
160197
allow_full_rewrites: false
161-
max_code_length: 50000
198+
max_code_length: 45000 # Encourage concise, focused optimizations

0 commit comments

Comments
 (0)