|
1 | | -# MLX-LM Performance Optimization with OpenEvolve |
| 1 | +# MLX Training Performance Optimization with OpenEvolve |
2 | 2 |
|
3 | | -This example demonstrates using OpenEvolve to optimize real MLX-LM inference and training performance on Apple Silicon, directly measuring speedups on the `Qwen2.5-0.5B-Instruct-bf16` model. |
| 3 | +This example demonstrates using OpenEvolve to optimize MLX training performance on Apple Silicon, focusing exclusively on accelerating neural network training workloads. |
4 | 4 |
|
5 | | -## The New Approach: Real-World MLX-LM Optimization |
| 5 | +## The Training-Focused Approach: Real-World MLX Training Optimization |
6 | 6 |
|
7 | | -Instead of synthetic matrix benchmarks, we now optimize **actual MLX-LM performance**: |
| 7 | +We now focus exclusively on **MLX training performance** optimization: |
8 | 8 |
|
9 | | -✅ **Real model**: Qwen2.5-0.5B-Instruct-bf16 for fast but realistic testing |
10 | | -✅ **Real workloads**: Text generation (inference) and training simulation |
11 | | -✅ **Real metrics**: End-to-end speedup measurement vs original MLX |
12 | | -✅ **Practical focus**: Optimize for transformer attention and MLP patterns |
| 9 | +✅ **Training Workloads**: Forward + backward passes with gradient computation |
| 10 | +✅ **Realistic Models**: Transformer architectures with substantial matrix operations |
| 11 | +✅ **Training Patterns**: Batch processing, MLP layers, attention computation |
| 12 | +✅ **Clear Signal**: Consistent evaluation without inference noise |
| 13 | +✅ **Practical Value**: Accelerate model development and research workflows |
13 | 14 |
|
14 | | -## Background |
| 15 | +## Why Training-Only Optimization? |
15 | 16 |
|
16 | | -MLX is the fastest inference engine on Apple Silicon: |
| 17 | +### 1. **Cleaner Evaluation Signal** |
17 | 18 |
|
18 | | -``` |
19 | | -Performance Comparison: |
20 | | -pytorch_mps : 1.190s avg, 42.0 tokens/s |
21 | | -mlx : 0.044s avg, 1135.8 tokens/s ⭐ 25x FASTER |
22 | | -llama_cpp : 0.316s avg, 158.0 tokens/s |
23 | | -``` |
24 | | - |
25 | | -However, MLX's matrix multiplication can be further optimized through intelligent tiling strategies that better utilize Apple Silicon's architecture. |
26 | | - |
27 | | -## The Optimization Challenge |
| 19 | +Training provides much more consistent evaluation than inference: |
28 | 20 |
|
29 | | -MLX-LM performance depends on efficient matrix multiplication for: |
| 21 | +```python |
| 22 | +# Training: Deterministic, substantial computation |
| 23 | +def training_step(): |
| 24 | + inputs = mx.random.randint(0, vocab_size, (batch_size, seq_len)) # Fixed size |
| 25 | + logits = model(inputs) # Deterministic forward pass |
| 26 | + loss, grads = mx.value_and_grad(loss_fn)(model, inputs, targets) # Gradient computation |
| 27 | + optimizer.update(model, grads) # Parameter updates |
| 28 | +``` |
30 | 29 |
|
31 | | -🧠 **Transformer Workloads**: |
32 | | -- **Attention layers**: (batch×seq_len) × hidden_dim × hidden_dim |
33 | | -- **MLP expansion**: (batch×seq_len) × hidden_dim × (4×hidden_dim) |
34 | | -- **MLP projection**: (batch×seq_len) × (4×hidden_dim) × hidden_dim |
35 | | -- **Output projection**: (batch×seq_len) × hidden_dim × vocab_size |
| 30 | +**Benefits:** |
| 31 | +- No model loading overhead (1-2 second penalty eliminated) |
| 32 | +- No text generation variability |
| 33 | +- Deterministic computation graphs |
| 34 | +- Consistent matrix dimensions across runs |
| 35 | +- More matrix operations per evaluation |
36 | 36 |
|
37 | | -🏗️ **Apple Silicon Architecture**: |
38 | | -- **M1/M2**: 16-element vector units, 12-20MB L2 cache |
39 | | -- **M3/M4**: 32-element AMX units, 24-48MB shared cache |
40 | | -- **All**: Unified memory with 200-400GB/s bandwidth |
41 | | -- **Challenge**: Choose optimal tile sizes for each chip and workload |
| 37 | +### 2. **Training-Specific Matrix Patterns** |
42 | 38 |
|
43 | | -## How OpenEvolve Optimizes MLX-LM |
| 39 | +Training has unique characteristics that benefit from specialized optimization: |
44 | 40 |
|
45 | | -OpenEvolve evolves the `choose_tile_size()` function to: |
| 41 | +🧠 **Training Workload Patterns**: |
| 42 | +- **Larger Batch Sizes**: 16-32 vs 1-4 for inference |
| 43 | +- **Forward + Backward**: Double the matrix operations |
| 44 | +- **Gradient Computation**: Requires transpose operations |
| 45 | +- **Memory Pressure**: Activations + gradients + parameters |
| 46 | +- **Repeated Patterns**: Same operations across many training steps |
46 | 47 |
|
47 | | -1. **Detect workload patterns** (attention vs MLP) mathematically |
48 | | -2. **Adapt to Apple Silicon variant** (M1/M2/M3/M4 specific optimizations) |
49 | | -3. **Balance memory hierarchy** (L1/L2 cache vs unified memory bandwidth) |
50 | | -4. **Optimize for real transformer patterns** (not synthetic benchmarks) |
| 48 | +🎯 **Optimization Opportunities**: |
| 49 | +- **Batch-Aware Tiling**: Different strategies for larger batch dimensions |
| 50 | +- **Gradient-Friendly Patterns**: Consider transpose operations in backward pass |
| 51 | +- **Memory Hierarchy**: Balance cache usage with gradient storage |
| 52 | +- **Training Consistency**: Optimize for repeated execution patterns |
51 | 53 |
|
52 | | -## Quick Start |
| 54 | +### 3. **Substantial Practical Value** |
53 | 55 |
|
54 | | -### Install Dependencies |
55 | | -```bash |
56 | | -pip install -r requirements.txt |
57 | | -``` |
58 | | - |
59 | | -### Run Real MLX-LM Optimization |
60 | | -```bash |
61 | | -python ../../openevolve-run.py initial_program.py evaluator.py --config config.yaml --iterations 200 |
62 | | -``` |
| 56 | +Training optimization provides real benefits: |
| 57 | +- **Faster Research Iteration**: Quicker model development cycles |
| 58 | +- **Cost Reduction**: Lower compute costs for training runs |
| 59 | +- **Better Hardware Utilization**: More efficient use of Apple Silicon |
| 60 | +- **Scalability**: Benefits increase with larger models and datasets |
63 | 61 |
|
64 | | -### Resume from Checkpoint |
65 | | -```bash |
66 | | -# If interrupted, resume with: |
67 | | -python ../../openevolve-run.py initial_program.py evaluator.py --config config.yaml --checkpoint ./openevolve_output/mlx_lm_optimization_db/checkpoints/checkpoint_XX --iterations 100 |
68 | | -``` |
| 62 | +## Technical Implementation |
69 | 63 |
|
70 | | -## What Gets Optimized |
| 64 | +### Matrix Operation Focus |
71 | 65 |
|
72 | | -The evolution targets two key functions: |
| 66 | +The evolution targets the key functions used in training: |
73 | 67 |
|
74 | | -### 1. Tile Size Selection |
75 | 68 | ```python |
76 | 69 | def choose_tile_size(M, N, K, device_info): |
77 | 70 | """ |
78 | | - Choose optimal tile sizes for MLX matrix multiplication |
79 | | - |
80 | | - Args: |
81 | | - M, N, K: Matrix dimensions (C = A @ B where A is M×K, B is K×N) |
82 | | - device_info: Apple Silicon characteristics (chip, memory, etc.) |
83 | | - |
84 | | - Returns: |
85 | | - (tile_M, tile_N, tile_K): Optimal tile sizes for this workload |
| 71 | + Optimize for training-specific patterns: |
| 72 | + - Batch-heavy matrices (large M dimension) |
| 73 | + - MLP expansion/projection (4x hidden dimension scaling) |
| 74 | + - Attention computation (square-ish matrices) |
| 75 | + - Gradient computation (consider transpose patterns) |
86 | 76 | """ |
87 | | - # This function gets evolved by OpenEvolve! |
88 | | - # From simple heuristics to sophisticated Apple Silicon optimization |
89 | | -``` |
90 | 77 |
|
91 | | -### 2. Optimized Matrix Multiplication |
92 | | -```python |
93 | 78 | def optimized_matmul(A, B, tile_M, tile_N, tile_K): |
94 | 79 | """ |
95 | | - Perform tiled matrix multiplication with optimized memory access patterns |
96 | | - |
97 | | - Must be numerically correct while maximizing Apple Silicon performance |
| 80 | + Implement tiled multiplication optimized for: |
| 81 | + - Training memory access patterns |
| 82 | + - Apple Silicon architecture |
| 83 | + - Cache efficiency with gradient storage |
98 | 84 | """ |
99 | | - # This function implements the actual tiled computation |
100 | 85 | ``` |
101 | 86 |
|
102 | | -## Expected Results |
103 | | - |
104 | | -OpenEvolve should discover optimizations that provide: |
105 | | - |
106 | | -📈 **Inference Speedup**: 5-15% faster text generation |
107 | | -📈 **Training Speedup**: 10-25% faster training steps |
108 | | -🎯 **Targeted Optimization**: Better performance on larger batches and longer sequences |
109 | | -🏗️ **Architecture Awareness**: M3/M4 perform better than M1/M2 |
110 | | - |
111 | | -## Real-World Integration |
| 87 | +### Enhanced Training Evaluation |
112 | 88 |
|
113 | | -Once optimized, integrate with any MLX-LM workflow: |
| 89 | +The evaluator creates realistic training scenarios: |
114 | 90 |
|
115 | 91 | ```python |
116 | | -from mlx_lm import load, generate |
117 | | -from mlx_lm_openevolve import enable_optimizations |
118 | | - |
119 | | -# Enable OpenEvolve optimizations |
120 | | -enable_optimizations("./openevolve_output/best/best_program.py") |
| 92 | +class EnhancedTrainingModel(nn.Module): |
| 93 | + """ |
| 94 | + Transformer-like model with substantial matrix operations: |
| 95 | + - Multiple MLP layers (4x expansion/projection) |
| 96 | + - Attention-like operations |
| 97 | + - Large output projections |
| 98 | + - Forward + backward passes |
| 99 | + """ |
121 | 100 |
|
122 | | -# Your existing code gets automatic speedups! |
123 | | -model, tokenizer = load("mlx-community/Qwen2.5-0.5B-Instruct-bf16") |
124 | | -text = generate(model, tokenizer, prompt="Hello world", verbose=True) |
| 101 | +# Training Configuration |
| 102 | +batch_size = 32 # Realistic training batch |
| 103 | +seq_len = 512 # Longer sequences |
| 104 | +hidden_dim = 1024 # Large hidden dimension |
| 105 | +vocab_size = 6000 # Substantial vocabulary |
125 | 106 | ``` |
126 | 107 |
|
127 | | -## Advanced: Understanding the Evaluation |
| 108 | +## Quick Start |
128 | 109 |
|
129 | | -The new evaluator directly measures MLX-LM performance: |
| 110 | +### Install Dependencies |
| 111 | +```bash |
| 112 | +pip install -r requirements.txt |
| 113 | +``` |
130 | 114 |
|
131 | | -### Inference Test |
132 | | -1. Load Qwen2.5-0.5B-Instruct-bf16 model |
133 | | -2. Generate text with original MLX |
134 | | -3. Generate same text with optimized MLX |
135 | | -4. Measure speedup ratio |
| 115 | +### Run Training-Focused Optimization |
| 116 | +```bash |
| 117 | +python ../../openevolve-run.py initial_program.py evaluator.py --config config.yaml --iterations 200 |
| 118 | +``` |
136 | 119 |
|
137 | | -### Training Test |
138 | | -1. Create realistic training scenario with transformer layers |
139 | | -2. Run training steps with original MLX |
140 | | -3. Run same steps with optimized MLX |
141 | | -4. Measure training speedup ratio |
| 120 | +### Resume from Checkpoint |
| 121 | +```bash |
| 122 | +# If interrupted, resume with: |
| 123 | +python ../../openevolve-run.py initial_program.py evaluator.py --config config.yaml --checkpoint ./openevolve_output/mlx_training_optimization_db/checkpoints/checkpoint_XX --iterations 100 |
| 124 | +``` |
| 125 | + |
| 126 | +## Expected Results |
142 | 127 |
|
143 | | -### Combined Score |
144 | | -- **70% weight**: Inference speedup (most common use case) |
145 | | -- **30% weight**: Training speedup (development workflows) |
146 | | -- **Bonus**: Consistent optimization across both workloads |
| 128 | +The training-focused approach should discover optimizations providing: |
147 | 129 |
|
148 | | -## Comparison to Synthetic Benchmarks |
| 130 | +📈 **Training Speedup**: 10-25% faster training steps |
| 131 | +🎯 **Consistent Optimization**: Better signal-to-noise ratio for evolution |
| 132 | +🔧 **Architecture-Aware**: M1/M2/M3/M4 specific optimizations |
| 133 | +⚡ **Memory Efficient**: Optimized for training's memory pressure |
149 | 134 |
|
150 | | -| **Synthetic Matrix Benchmark** | **Real MLX-LM Optimization** | |
151 | | -|--------------------------------|-------------------------------| |
152 | | -| ❌ Artificial matrix sizes | ✅ Real transformer dimensions | |
153 | | -| ❌ GFLOPS (doesn't reflect user experience) | ✅ End-to-end speedup (what users feel) | |
154 | | -| ❌ Isolated operations | ✅ Full model inference/training | |
155 | | -| ❌ May not transfer to real workloads | ✅ Directly optimizes actual use cases | |
| 135 | +## Evolution Discoveries |
156 | 136 |
|
157 | | -## Expected Evolution Discoveries |
| 137 | +Based on training characteristics and Apple Silicon architecture, expect OpenEvolve to discover: |
158 | 138 |
|
159 | | -Based on transformer architecture and Apple Silicon characteristics, expect OpenEvolve to discover: |
| 139 | +🧠 **Training Workload Classification**: |
| 140 | +```python |
| 141 | +is_batch_heavy = (M > 256) # Large batch dimension |
| 142 | +is_mlp = (aspect_ratio_K > 1.5) # MLP 4x expansion patterns |
| 143 | +is_gradient_computation = (transpose_pattern_detected) # Backward pass |
| 144 | +``` |
160 | 145 |
|
161 | | -🧠 **Workload Classification**: |
| 146 | +🔧 **Apple Silicon Training Optimization**: |
162 | 147 | ```python |
163 | | -k_dominance = K / max(M, N) # Detect MLP vs attention patterns |
164 | | -aspect_ratio = max(M, N) / min(M, N) # Handle rectangular matrices |
| 148 | +if "M4" in chip and is_batch_heavy: |
| 149 | + base_tile = 128; vector_align = 32 # Large tiles for AMX units |
| 150 | + memory_scale = 1.5 # Training can use more memory |
| 151 | +elif is_mlp and training_workload: |
| 152 | + k_bias = 1.3 # Favor K dimension for MLP patterns |
165 | 153 | ``` |
166 | 154 |
|
167 | | -🔧 **Chip-Specific Optimization**: |
| 155 | +⚡ **Training Memory Patterns**: |
168 | 156 | ```python |
169 | | -if "M4" in chip: |
170 | | - base_tile = 512; vector_align = 32 # Large tiles, AMX units |
171 | | -elif "M1" in chip: |
172 | | - base_tile = 256; vector_align = 16 # Smaller tiles, older architecture |
| 157 | +# Optimize for training's repeated execution |
| 158 | +if total_elements > 1_000_000 and is_training: |
| 159 | + scale = 1.1 # Larger tiles for substantial computation |
| 160 | + cache_optimization = "training_friendly" # Consider gradient storage |
173 | 161 | ``` |
174 | 162 |
|
175 | | -⚡ **Memory Hierarchy Optimization**: |
| 163 | +## Integration with Training Workflows |
| 164 | + |
| 165 | +Once optimized, integrate with any MLX training code: |
| 166 | + |
176 | 167 | ```python |
177 | | -# Balance L2 cache utilization vs memory bandwidth |
178 | | -cache_factor = device_info["l2_cache_mb"] / 16.0 |
179 | | -memory_factor = min(2.0, device_info["memory_gb"] / 16.0) |
| 168 | +import mlx.core as mx |
| 169 | +from optimized_kernels import enable_training_optimizations |
| 170 | + |
| 171 | +# Enable OpenEvolve training optimizations |
| 172 | +enable_training_optimizations("./openevolve_output/best/best_program.py") |
| 173 | + |
| 174 | +# Your existing training code gets automatic speedups! |
| 175 | +for epoch in range(num_epochs): |
| 176 | + for batch in dataloader: |
| 177 | + loss, grads = mx.value_and_grad(loss_fn)(model, batch) |
| 178 | + optimizer.update(model, grads) # Now faster! |
180 | 179 | ``` |
181 | 180 |
|
182 | | -This represents a significant advance from generic matrix optimization to **transformer-aware, Apple Silicon-specific, real-world performance optimization**. |
| 181 | +## Comparison: Training vs Inference Optimization |
| 182 | + |
| 183 | +| **Inference Optimization** | **Training Optimization** | |
| 184 | +|------------------------------|---------------------------| |
| 185 | +| ❌ Noisy evaluation (model loading, text generation) | ✅ Clean evaluation (deterministic computation) | |
| 186 | +| ❌ Small matrices (batch=1-4) | ✅ Large matrices (batch=16-32) | |
| 187 | +| ❌ Variable workloads | ✅ Consistent patterns | |
| 188 | +| ❌ Complex pipeline overhead | ✅ Direct matrix operation focus | |
| 189 | +| ❌ Difficult signal extraction | ✅ Clear optimization signal | |
183 | 190 |
|
184 | 191 | ## Research Impact |
185 | 192 |
|
186 | | -This approach demonstrates: |
| 193 | +This training-focused approach demonstrates: |
187 | 194 |
|
188 | | -1. **Practical AI Optimization**: Directly optimizing real AI workloads, not synthetic benchmarks |
189 | | -2. **Hardware-Software Co-Design**: Evolving algorithms specifically for Apple Silicon architecture |
190 | | -3. **Measurable User Benefit**: End-to-end speedups that users actually experience |
191 | | -4. **Automated Discovery**: Finding optimizations that would take experts months to develop manually |
| 195 | +1. **Practical AI Acceleration**: Directly optimizing the bottleneck of model development |
| 196 | +2. **Hardware-Software Co-Design**: Training-specific optimizations for Apple Silicon |
| 197 | +3. **Clear Evaluation Methodology**: Robust metrics for evolutionary optimization |
| 198 | +4. **Real-World Application**: Immediate benefits for ML researchers and practitioners |
192 | 199 |
|
193 | | -This moves beyond proof-of-concept to **production-ready AI performance optimization**. |
| 200 | +This moves from proof-of-concept to **production-ready training acceleration** that ML practitioners can immediately benefit from. |
0 commit comments