Skip to content

Commit 7e9b419

Browse files
committed
c
1 parent 2f2ae23 commit 7e9b419

File tree

2 files changed

+212
-176
lines changed

2 files changed

+212
-176
lines changed
Lines changed: 142 additions & 135 deletions
Original file line numberDiff line numberDiff line change
@@ -1,193 +1,200 @@
1-
# MLX-LM Performance Optimization with OpenEvolve
1+
# MLX Training Performance Optimization with OpenEvolve
22

3-
This example demonstrates using OpenEvolve to optimize real MLX-LM inference and training performance on Apple Silicon, directly measuring speedups on the `Qwen2.5-0.5B-Instruct-bf16` model.
3+
This example demonstrates using OpenEvolve to optimize MLX training performance on Apple Silicon, focusing exclusively on accelerating neural network training workloads.
44

5-
## The New Approach: Real-World MLX-LM Optimization
5+
## The Training-Focused Approach: Real-World MLX Training Optimization
66

7-
Instead of synthetic matrix benchmarks, we now optimize **actual MLX-LM performance**:
7+
We now focus exclusively on **MLX training performance** optimization:
88

9-
**Real model**: Qwen2.5-0.5B-Instruct-bf16 for fast but realistic testing
10-
**Real workloads**: Text generation (inference) and training simulation
11-
**Real metrics**: End-to-end speedup measurement vs original MLX
12-
**Practical focus**: Optimize for transformer attention and MLP patterns
9+
**Training Workloads**: Forward + backward passes with gradient computation
10+
**Realistic Models**: Transformer architectures with substantial matrix operations
11+
**Training Patterns**: Batch processing, MLP layers, attention computation
12+
**Clear Signal**: Consistent evaluation without inference noise
13+
**Practical Value**: Accelerate model development and research workflows
1314

14-
## Background
15+
## Why Training-Only Optimization?
1516

16-
MLX is the fastest inference engine on Apple Silicon:
17+
### 1. **Cleaner Evaluation Signal**
1718

18-
```
19-
Performance Comparison:
20-
pytorch_mps : 1.190s avg, 42.0 tokens/s
21-
mlx : 0.044s avg, 1135.8 tokens/s ⭐ 25x FASTER
22-
llama_cpp : 0.316s avg, 158.0 tokens/s
23-
```
24-
25-
However, MLX's matrix multiplication can be further optimized through intelligent tiling strategies that better utilize Apple Silicon's architecture.
26-
27-
## The Optimization Challenge
19+
Training provides much more consistent evaluation than inference:
2820

29-
MLX-LM performance depends on efficient matrix multiplication for:
21+
```python
22+
# Training: Deterministic, substantial computation
23+
def training_step():
24+
inputs = mx.random.randint(0, vocab_size, (batch_size, seq_len)) # Fixed size
25+
logits = model(inputs) # Deterministic forward pass
26+
loss, grads = mx.value_and_grad(loss_fn)(model, inputs, targets) # Gradient computation
27+
optimizer.update(model, grads) # Parameter updates
28+
```
3029

31-
🧠 **Transformer Workloads**:
32-
- **Attention layers**: (batch×seq_len) × hidden_dim × hidden_dim
33-
- **MLP expansion**: (batch×seq_len) × hidden_dim × (4×hidden_dim)
34-
- **MLP projection**: (batch×seq_len) × (4×hidden_dim) × hidden_dim
35-
- **Output projection**: (batch×seq_len) × hidden_dim × vocab_size
30+
**Benefits:**
31+
- No model loading overhead (1-2 second penalty eliminated)
32+
- No text generation variability
33+
- Deterministic computation graphs
34+
- Consistent matrix dimensions across runs
35+
- More matrix operations per evaluation
3636

37-
🏗️ **Apple Silicon Architecture**:
38-
- **M1/M2**: 16-element vector units, 12-20MB L2 cache
39-
- **M3/M4**: 32-element AMX units, 24-48MB shared cache
40-
- **All**: Unified memory with 200-400GB/s bandwidth
41-
- **Challenge**: Choose optimal tile sizes for each chip and workload
37+
### 2. **Training-Specific Matrix Patterns**
4238

43-
## How OpenEvolve Optimizes MLX-LM
39+
Training has unique characteristics that benefit from specialized optimization:
4440

45-
OpenEvolve evolves the `choose_tile_size()` function to:
41+
🧠 **Training Workload Patterns**:
42+
- **Larger Batch Sizes**: 16-32 vs 1-4 for inference
43+
- **Forward + Backward**: Double the matrix operations
44+
- **Gradient Computation**: Requires transpose operations
45+
- **Memory Pressure**: Activations + gradients + parameters
46+
- **Repeated Patterns**: Same operations across many training steps
4647

47-
1. **Detect workload patterns** (attention vs MLP) mathematically
48-
2. **Adapt to Apple Silicon variant** (M1/M2/M3/M4 specific optimizations)
49-
3. **Balance memory hierarchy** (L1/L2 cache vs unified memory bandwidth)
50-
4. **Optimize for real transformer patterns** (not synthetic benchmarks)
48+
🎯 **Optimization Opportunities**:
49+
- **Batch-Aware Tiling**: Different strategies for larger batch dimensions
50+
- **Gradient-Friendly Patterns**: Consider transpose operations in backward pass
51+
- **Memory Hierarchy**: Balance cache usage with gradient storage
52+
- **Training Consistency**: Optimize for repeated execution patterns
5153

52-
## Quick Start
54+
### 3. **Substantial Practical Value**
5355

54-
### Install Dependencies
55-
```bash
56-
pip install -r requirements.txt
57-
```
58-
59-
### Run Real MLX-LM Optimization
60-
```bash
61-
python ../../openevolve-run.py initial_program.py evaluator.py --config config.yaml --iterations 200
62-
```
56+
Training optimization provides real benefits:
57+
- **Faster Research Iteration**: Quicker model development cycles
58+
- **Cost Reduction**: Lower compute costs for training runs
59+
- **Better Hardware Utilization**: More efficient use of Apple Silicon
60+
- **Scalability**: Benefits increase with larger models and datasets
6361

64-
### Resume from Checkpoint
65-
```bash
66-
# If interrupted, resume with:
67-
python ../../openevolve-run.py initial_program.py evaluator.py --config config.yaml --checkpoint ./openevolve_output/mlx_lm_optimization_db/checkpoints/checkpoint_XX --iterations 100
68-
```
62+
## Technical Implementation
6963

70-
## What Gets Optimized
64+
### Matrix Operation Focus
7165

72-
The evolution targets two key functions:
66+
The evolution targets the key functions used in training:
7367

74-
### 1. Tile Size Selection
7568
```python
7669
def choose_tile_size(M, N, K, device_info):
7770
"""
78-
Choose optimal tile sizes for MLX matrix multiplication
79-
80-
Args:
81-
M, N, K: Matrix dimensions (C = A @ B where A is M×K, B is K×N)
82-
device_info: Apple Silicon characteristics (chip, memory, etc.)
83-
84-
Returns:
85-
(tile_M, tile_N, tile_K): Optimal tile sizes for this workload
71+
Optimize for training-specific patterns:
72+
- Batch-heavy matrices (large M dimension)
73+
- MLP expansion/projection (4x hidden dimension scaling)
74+
- Attention computation (square-ish matrices)
75+
- Gradient computation (consider transpose patterns)
8676
"""
87-
# This function gets evolved by OpenEvolve!
88-
# From simple heuristics to sophisticated Apple Silicon optimization
89-
```
9077

91-
### 2. Optimized Matrix Multiplication
92-
```python
9378
def optimized_matmul(A, B, tile_M, tile_N, tile_K):
9479
"""
95-
Perform tiled matrix multiplication with optimized memory access patterns
96-
97-
Must be numerically correct while maximizing Apple Silicon performance
80+
Implement tiled multiplication optimized for:
81+
- Training memory access patterns
82+
- Apple Silicon architecture
83+
- Cache efficiency with gradient storage
9884
"""
99-
# This function implements the actual tiled computation
10085
```
10186

102-
## Expected Results
103-
104-
OpenEvolve should discover optimizations that provide:
105-
106-
📈 **Inference Speedup**: 5-15% faster text generation
107-
📈 **Training Speedup**: 10-25% faster training steps
108-
🎯 **Targeted Optimization**: Better performance on larger batches and longer sequences
109-
🏗️ **Architecture Awareness**: M3/M4 perform better than M1/M2
110-
111-
## Real-World Integration
87+
### Enhanced Training Evaluation
11288

113-
Once optimized, integrate with any MLX-LM workflow:
89+
The evaluator creates realistic training scenarios:
11490

11591
```python
116-
from mlx_lm import load, generate
117-
from mlx_lm_openevolve import enable_optimizations
118-
119-
# Enable OpenEvolve optimizations
120-
enable_optimizations("./openevolve_output/best/best_program.py")
92+
class EnhancedTrainingModel(nn.Module):
93+
"""
94+
Transformer-like model with substantial matrix operations:
95+
- Multiple MLP layers (4x expansion/projection)
96+
- Attention-like operations
97+
- Large output projections
98+
- Forward + backward passes
99+
"""
121100

122-
# Your existing code gets automatic speedups!
123-
model, tokenizer = load("mlx-community/Qwen2.5-0.5B-Instruct-bf16")
124-
text = generate(model, tokenizer, prompt="Hello world", verbose=True)
101+
# Training Configuration
102+
batch_size = 32 # Realistic training batch
103+
seq_len = 512 # Longer sequences
104+
hidden_dim = 1024 # Large hidden dimension
105+
vocab_size = 6000 # Substantial vocabulary
125106
```
126107

127-
## Advanced: Understanding the Evaluation
108+
## Quick Start
128109

129-
The new evaluator directly measures MLX-LM performance:
110+
### Install Dependencies
111+
```bash
112+
pip install -r requirements.txt
113+
```
130114

131-
### Inference Test
132-
1. Load Qwen2.5-0.5B-Instruct-bf16 model
133-
2. Generate text with original MLX
134-
3. Generate same text with optimized MLX
135-
4. Measure speedup ratio
115+
### Run Training-Focused Optimization
116+
```bash
117+
python ../../openevolve-run.py initial_program.py evaluator.py --config config.yaml --iterations 200
118+
```
136119

137-
### Training Test
138-
1. Create realistic training scenario with transformer layers
139-
2. Run training steps with original MLX
140-
3. Run same steps with optimized MLX
141-
4. Measure training speedup ratio
120+
### Resume from Checkpoint
121+
```bash
122+
# If interrupted, resume with:
123+
python ../../openevolve-run.py initial_program.py evaluator.py --config config.yaml --checkpoint ./openevolve_output/mlx_training_optimization_db/checkpoints/checkpoint_XX --iterations 100
124+
```
125+
126+
## Expected Results
142127

143-
### Combined Score
144-
- **70% weight**: Inference speedup (most common use case)
145-
- **30% weight**: Training speedup (development workflows)
146-
- **Bonus**: Consistent optimization across both workloads
128+
The training-focused approach should discover optimizations providing:
147129

148-
## Comparison to Synthetic Benchmarks
130+
📈 **Training Speedup**: 10-25% faster training steps
131+
🎯 **Consistent Optimization**: Better signal-to-noise ratio for evolution
132+
🔧 **Architecture-Aware**: M1/M2/M3/M4 specific optimizations
133+
**Memory Efficient**: Optimized for training's memory pressure
149134

150-
| **Synthetic Matrix Benchmark** | **Real MLX-LM Optimization** |
151-
|--------------------------------|-------------------------------|
152-
| ❌ Artificial matrix sizes | ✅ Real transformer dimensions |
153-
| ❌ GFLOPS (doesn't reflect user experience) | ✅ End-to-end speedup (what users feel) |
154-
| ❌ Isolated operations | ✅ Full model inference/training |
155-
| ❌ May not transfer to real workloads | ✅ Directly optimizes actual use cases |
135+
## Evolution Discoveries
156136

157-
## Expected Evolution Discoveries
137+
Based on training characteristics and Apple Silicon architecture, expect OpenEvolve to discover:
158138

159-
Based on transformer architecture and Apple Silicon characteristics, expect OpenEvolve to discover:
139+
🧠 **Training Workload Classification**:
140+
```python
141+
is_batch_heavy = (M > 256) # Large batch dimension
142+
is_mlp = (aspect_ratio_K > 1.5) # MLP 4x expansion patterns
143+
is_gradient_computation = (transpose_pattern_detected) # Backward pass
144+
```
160145

161-
🧠 **Workload Classification**:
146+
🔧 **Apple Silicon Training Optimization**:
162147
```python
163-
k_dominance = K / max(M, N) # Detect MLP vs attention patterns
164-
aspect_ratio = max(M, N) / min(M, N) # Handle rectangular matrices
148+
if "M4" in chip and is_batch_heavy:
149+
base_tile = 128; vector_align = 32 # Large tiles for AMX units
150+
memory_scale = 1.5 # Training can use more memory
151+
elif is_mlp and training_workload:
152+
k_bias = 1.3 # Favor K dimension for MLP patterns
165153
```
166154

167-
🔧 **Chip-Specific Optimization**:
155+
**Training Memory Patterns**:
168156
```python
169-
if "M4" in chip:
170-
base_tile = 512; vector_align = 32 # Large tiles, AMX units
171-
elif "M1" in chip:
172-
base_tile = 256; vector_align = 16 # Smaller tiles, older architecture
157+
# Optimize for training's repeated execution
158+
if total_elements > 1_000_000 and is_training:
159+
scale = 1.1 # Larger tiles for substantial computation
160+
cache_optimization = "training_friendly" # Consider gradient storage
173161
```
174162

175-
**Memory Hierarchy Optimization**:
163+
## Integration with Training Workflows
164+
165+
Once optimized, integrate with any MLX training code:
166+
176167
```python
177-
# Balance L2 cache utilization vs memory bandwidth
178-
cache_factor = device_info["l2_cache_mb"] / 16.0
179-
memory_factor = min(2.0, device_info["memory_gb"] / 16.0)
168+
import mlx.core as mx
169+
from optimized_kernels import enable_training_optimizations
170+
171+
# Enable OpenEvolve training optimizations
172+
enable_training_optimizations("./openevolve_output/best/best_program.py")
173+
174+
# Your existing training code gets automatic speedups!
175+
for epoch in range(num_epochs):
176+
for batch in dataloader:
177+
loss, grads = mx.value_and_grad(loss_fn)(model, batch)
178+
optimizer.update(model, grads) # Now faster!
180179
```
181180

182-
This represents a significant advance from generic matrix optimization to **transformer-aware, Apple Silicon-specific, real-world performance optimization**.
181+
## Comparison: Training vs Inference Optimization
182+
183+
| **Inference Optimization** | **Training Optimization** |
184+
|------------------------------|---------------------------|
185+
| ❌ Noisy evaluation (model loading, text generation) | ✅ Clean evaluation (deterministic computation) |
186+
| ❌ Small matrices (batch=1-4) | ✅ Large matrices (batch=16-32) |
187+
| ❌ Variable workloads | ✅ Consistent patterns |
188+
| ❌ Complex pipeline overhead | ✅ Direct matrix operation focus |
189+
| ❌ Difficult signal extraction | ✅ Clear optimization signal |
183190

184191
## Research Impact
185192

186-
This approach demonstrates:
193+
This training-focused approach demonstrates:
187194

188-
1. **Practical AI Optimization**: Directly optimizing real AI workloads, not synthetic benchmarks
189-
2. **Hardware-Software Co-Design**: Evolving algorithms specifically for Apple Silicon architecture
190-
3. **Measurable User Benefit**: End-to-end speedups that users actually experience
191-
4. **Automated Discovery**: Finding optimizations that would take experts months to develop manually
195+
1. **Practical AI Acceleration**: Directly optimizing the bottleneck of model development
196+
2. **Hardware-Software Co-Design**: Training-specific optimizations for Apple Silicon
197+
3. **Clear Evaluation Methodology**: Robust metrics for evolutionary optimization
198+
4. **Real-World Application**: Immediate benefits for ML researchers and practitioners
192199

193-
This moves beyond proof-of-concept to **production-ready AI performance optimization**.
200+
This moves from proof-of-concept to **production-ready training acceleration** that ML practitioners can immediately benefit from.

0 commit comments

Comments
 (0)