algorithmicsuperintelligence
diff --git a/‎examples/mlx_metal_kernel_opt/README.md‎
Lines changed: 189 additions & 0 deletions b/‎examples/mlx_metal_kernel_opt/README.md‎
Lines changed: 189 additions & 0 deletions
diff --git a/‎examples/mlx_metal_kernel_opt/config.yaml‎
Lines changed: 162 additions & 0 deletions b/‎examples/mlx_metal_kernel_opt/config.yaml‎
Lines changed: 162 additions & 0 deletions
@@ -0,0 +1,189 @@
+# 🎯 Qwen3-0.6B Custom GQA Attention Optimization
+
+**Evolving custom Grouped Query Attention kernels using MLX primitives for Qwen3-0.6B on Apple M4**
+
+This example demonstrates AlphaEvolve's kernel optimization approach by implementing and evolving custom GQA attention computation using MLX primitives, targeting the specific 40:8 query-to-KV head pattern in Qwen3-0.6B.
+
+## 🔄 **Updated Approach: Custom Kernel Implementation**
+
+### **Why We Changed Strategy:**
+
+**Previous Approach (High-level orchestration):**
+- ❌ Only optimized around `mx.fast.scaled_dot_product_attention`
+- ❌ Limited optimization opportunities
+- ❌ Multiple EVOLVE-BLOCKS (OpenEvolve format violation)
+
+**Current Approach (Custom kernel implementation):**
+- ✅ **Custom GQA implementation** using MLX primitives 
+- ✅ **Real optimization opportunities** at computation level
+- ✅ **Single EVOLVE-BLOCK** with core attention computation
+- ✅ **Follows AlphaEvolve methodology** of optimizing actual kernels
+
+## 🎯 **Optimization Target**
+
+- **Model**: mlx-community/Qwen3-0.6B-bf16
+- **Architecture**: 40 query heads : 8 key/value heads (5:1 GQA ratio)
+- **Hardware**: Apple M4 24GB unified memory
+- **Baseline Performance**: 70.3 tokens/sec average decode speed
+- **Goal**: 80+ tokens/sec (14%+ improvement)
+
+## 🔧 **Custom GQA Implementation**
+
+### **Core Evolution Area (Single EVOLVE-BLOCK):**
+
+```python
+def __call__(self, x, mask=None, cache=None):
+    # Standard preprocessing...
+    queries = self.q_proj(x)  # [B, L, 40*128]
+    keys = self.k_proj(x)     # [B, L, 8*128] 
+    values = self.v_proj(x)   # [B, L, 8*128]
+    
+    # EVOLVE-BLOCK-START
+    # Custom GQA Attention Implementation using MLX primitives
+    # This replaces mx.fast.scaled_dot_product_attention entirely
+    
+    # Current baseline: Manual broadcasting + standard computation
+    keys_expanded = mx.repeat(keys, self.gqa_ratio, axis=1)     # [B, 40, L, 128]
+    values_expanded = mx.repeat(values, self.gqa_ratio, axis=1) # [B, 40, L, 128]
+    
+    scores = mx.matmul(queries, keys_expanded.transpose(0, 1, 3, 2)) * self.scale
+    attn_weights = mx.softmax(scores, axis=-1, precise=True)
+    output = mx.matmul(attn_weights, values_expanded)
+    
+    # EVOLUTION OPPORTUNITIES:
+    # 1. Better GQA broadcasting strategies (chunked computation)
+    # 2. Fused operations (combined matmul+softmax)
+    # 3. Memory layout optimization for Apple Silicon
+    # 4. Optimized causal masking
+    # EVOLVE-BLOCK-END
+```
+
+## 🚀 **Key Optimization Opportunities**
+
+### **1. GQA Broadcasting Strategies:**
+```python
+# Current: Explicit broadcasting with mx.repeat
+keys_expanded = mx.repeat(keys, 5, axis=1)  # Creates 5x memory usage
+
+# Evolution options:
+# - Chunked computation (process 5 query heads per KV head)
+# - On-demand broadcasting (avoid materialized copies)
+# - Strided access patterns (direct indexing)
+```
+
+### **2. Computation Fusion:**
+```python
+# Current: Separate operations
+scores = mx.matmul(queries, keys_t) * scale
+weights = mx.softmax(scores)
+output = mx.matmul(weights, values)
+
+# Evolution: Fused operations to reduce memory transfers
+```
+
+### **3. Apple Silicon Optimizations:**
+- bfloat16 native operations
+- Unified memory bandwidth optimization
+- Cache-friendly memory access patterns
+- SIMD-friendly computation layouts
+
+## 📊 **Baseline vs Custom Implementation**
+
+From your M4 benchmarks:
+```
+Baseline Performance (mx.fast.scaled_dot_product_attention):
+- Average decode: 70.3 tokens/sec
+- Range: 65.0 - 80.7 tokens/sec
+- Memory: 1.24-1.69 GB
+- Context degradation: ~7%
+
+Custom Implementation Target:
+- Average decode: 80+ tokens/sec (14%+ improvement)
+- Better memory efficiency
+- Improved context scaling
+- Maintained numerical accuracy
+```
+
+## 🧪 **Evaluation System**
+
+### **Comprehensive Testing:**
+1. **Correctness Verification**: Custom implementation produces identical results
+2. **Performance Benchmarking**: Real text generation on 5 key scenarios
+3. **Memory Efficiency**: Track memory usage vs baseline
+4. **Context Scaling**: Test performance across different sequence lengths
+
+### **Success Metrics:**
+- **Primary**: Average decode speed improvement (70.3 → 80+ tokens/sec)
+- **Secondary**: Memory efficiency, context scaling
+- **Critical**: Numerical correctness maintained
+
+## 🚀 **Usage**
+
+### **1. Test Initial Custom Implementation**
+```bash
+cd /Users/asankhaya/Documents/GitHub/openevolve/examples/mlx_metal_kernel_opt
+python initial_program.py  # Test custom GQA implementation
+```
+
+### **2. Run Evaluator Test**
+```bash
+python evaluator.py  # Test evaluation system
+```
+
+### **3. Start Evolution**
+```bash
+cd /Users/asankhaya/Documents/GitHub/openevolve
+python main.py --config examples/mlx_metal_kernel_opt/config.yaml
+```
+
+## 📈 **Expected Evolution Trajectory**
+
+### **Generation 1-10: Broadcasting Optimizations**
+- Chunked GQA computation strategies
+- Memory-efficient broadcasting alternatives
+- Target: 70.3 → 73-75 tokens/sec
+
+### **Generation 11-20: Computation Fusion**
+- Fused matmul + softmax operations
+- Optimized causal masking integration
+- Target: 75 → 78-82 tokens/sec
+
+### **Generation 21-30: Apple Silicon Specialization**
+- bfloat16 optimization
+- Unified memory access patterns
+- Advanced tensor layout optimization
+- Target: 80+ tokens/sec (14%+ improvement)
+
+## 🔍 **Key Advantages of Custom Implementation**
+
+### **Real Optimization Potential:**
+- **Kernel-level optimizations** using MLX primitives
+- **GQA-specific strategies** for 40:8 pattern
+- **Apple Silicon specialization** for M4 architecture
+- **Measurable improvements** on real workloads
+
+### **Realistic Scope:**
+- Uses MLX's optimized primitives (not raw Metal)
+- Maintains compatibility with mlx-lm ecosystem
+- Achievable 14% improvement target
+- Working baseline implementation
+
+### **Evolution-Friendly:**
+- Single EVOLVE-BLOCK with core computation
+- Clear optimization opportunities
+- Concrete performance targets
+- Systematic testing framework
+
+## 💡 **Why This Approach Will Work**
+
+1. **Real baseline**: 70.3 tokens/sec from actual M4 measurements
+2. **Custom implementation**: Full control over GQA computation  
+3. **MLX primitives**: Optimized building blocks, not raw Metal
+4. **Specific target**: Qwen3's exact 40:8 pattern, not generic attention
+5. **Proven methodology**: Following AlphaEvolve's kernel optimization approach
+
+This approach should evolve meaningful, measurable improvements for Qwen3-0.6B's specific GQA pattern while maintaining compatibility and correctness.
+
+---
+
+**🎯 Ready for custom kernel evolution!**
@@ -0,0 +1,162 @@
+# Qwen3-0.6B Custom GQA Attention Optimization Configuration
+# Target: Evolve custom GQA implementation using MLX primitives
+# Baseline: 70.3 tokens/sec average decode speed  
+# Goal: 80+ tokens/sec through custom kernel evolution
+
+max_iterations: 30
+checkpoint_interval: 5
+log_level: "INFO"
+
+# LLM configuration - proven models for kernel optimization
+llm:
+  primary_model: "gemini-2.5-flash-preview-05-20"
+  primary_model_weight: 0.7
+  secondary_model: "gemini-2.5-pro-preview-06-05"
+  secondary_model_weight: 0.3
+  api_base: "https://generativelanguage.googleapis.com/v1beta/openai/"
+  temperature: 0.7
+  top_p: 0.9
+  max_tokens: 32000
+  timeout: 300
+
+# Focused prompt for custom GQA kernel evolution
+prompt:
+  system_message: |
+    You are an expert in optimizing attention kernels using MLX primitives for Apple Silicon.
+    
+    # SPECIFIC TARGET: Custom GQA Attention Kernel Evolution
+    # CURRENT PERFORMANCE: 70.3 tokens/sec average decode speed
+    # GOAL: 80+ tokens/sec (14%+ improvement) through kernel-level optimizations
+    # HARDWARE: Apple M4 24GB unified memory
+    
+    # ARCHITECTURE DETAILS:
+    - Qwen3-0.6B: 40 query heads : 8 key/value heads (5:1 GQA ratio)
+    - Head dimension: 128, Hidden size: 5120
+    - Sequence lengths: 128-2048 tokens, Precision: bfloat16
+    
+    # CURRENT CUSTOM IMPLEMENTATION (Baseline to Evolve):
+    ```python
+    # Manual GQA broadcasting approach (can be optimized)
+    keys_expanded = mx.repeat(keys, self.gqa_ratio, axis=1)     # [B, 40, L, 128]
+    values_expanded = mx.repeat(values, self.gqa_ratio, axis=1) # [B, 40, L, 128]
+    
+    # Standard attention computation (room for optimization)
+    scores = mx.matmul(queries, keys_expanded.transpose(0, 1, 3, 2)) * self.scale
+    attn_weights = mx.softmax(scores, axis=-1, precise=True)
+    output = mx.matmul(attn_weights, values_expanded)
+    ```
+    
+    # KEY OPTIMIZATION OPPORTUNITIES:
+    
+    **1. GQA Broadcasting Strategies:**
+    Current: `mx.repeat` creates explicit copies of KV tensors
+    Alternatives:
+    - Chunked computation: Process 5 query heads per KV head separately
+    - On-demand broadcasting: Avoid materialized copies
+    - Strided access patterns: Direct indexing instead of repeat
+    - Memory-efficient reshaping: Better tensor layouts
+    
+    **2. Computation Fusion:**
+    Current: Separate matmul → softmax → matmul operations
+    Opportunities:
+    - Fused attention kernels using mx.fast primitives
+    - Combined operations to reduce memory transfers
+    - Optimized scaling and masking integration
+    
+    **3. Memory Access Optimization:**
+    Apple Silicon unified memory allows specific optimizations:
+    - Coalesced memory access for 40-head query tensor
+    - Cache-friendly KV head access patterns
+    - Reduced intermediate tensor allocations
+    - Better transpose operation ordering
+    
+    **4. Apple Silicon Specific Optimizations:**
+    - bfloat16 native operations
+    - Metal Performance Shaders integration
+    - Unified memory bandwidth optimization
+    - SIMD-friendly computation patterns
+    
+    **5. Sequence Length Scaling:**
+    Current performance degrades with longer contexts
+    Opportunities:
+    - Better attention computation chunking
+    - Optimized causal mask application
+    - Memory-efficient large sequence handling
+    
+    # EVOLUTION CONSTRAINTS:
+    1. ONLY modify code inside the single EVOLVE-BLOCK-START/END section
+    2. Use MLX primitives: mx.matmul, mx.softmax, mx.repeat, mx.where, etc.
+    3. Maintain numerical correctness (same output as baseline)
+    4. Keep tensor shapes compatible: input [B,40,L,128] output [B,40,L,128]
+    5. Support causal masking for autoregressive generation
+    
+    # SPECIFIC EVOLUTION STRATEGIES TO EXPLORE:
+    
+    **Strategy 1: Chunked GQA Computation**
+    Instead of broadcasting, process query heads in groups:
+    ```python
+    outputs = []
+    for i in range(self.gqa_ratio):  # 5 iterations
+        q_chunk = queries[:, i*8:(i+1)*8, :, :]  # [B, 8, L, 128]
+        scores = mx.matmul(q_chunk, keys.transpose(0, 1, 3, 2)) * self.scale
+        attn_weights = mx.softmax(scores, axis=-1)
+        output_chunk = mx.matmul(attn_weights, values)
+        outputs.append(output_chunk)
+    output = mx.concatenate(outputs, axis=1)
+    ```
+    
+    **Strategy 2: Optimized Broadcasting**
+    Use reshape and tile operations instead of repeat:
+    ```python
+    # More memory-efficient broadcasting
+    keys_reshaped = keys[:, :, None, :, :].repeat(self.gqa_ratio, axis=2)
+    keys_expanded = keys_reshaped.reshape(B, -1, L, 128)
+    ```
+    
+    **Strategy 3: Fused Operations**
+    Combine multiple operations to reduce memory transfers:
+    ```python
+    # Fused scaled dot-product attention using mx.fast primitives
+    # This might leverage optimized Metal kernels
+    ```
+    
+    **Strategy 4: Memory Layout Optimization**
+    Optimize tensor layouts for Apple Silicon:
+    ```python
+    # Ensure contiguous memory layouts
+    # Optimize transpose operations
+    # Reduce intermediate allocations
+    ```
+    
+    # SUCCESS METRICS (from benchmark suite):
+    - Average decode speed: 70.3 → 80+ tokens/sec (14%+ improvement)
+    - Memory efficiency: maintain <2GB usage
+    - Scaling: reduce performance drop with longer contexts
+    - Correctness: identical outputs to baseline implementation
+    
+    Focus on CONCRETE kernel optimizations using MLX primitives.
+    Test different GQA computation strategies systematically.
+    Prioritize memory bandwidth efficiency and computation fusion.
+    
+  num_top_programs: 4
+  num_diverse_programs: 2
+
+# Database configuration
+database:
+  db_path: "./openevolve_output/qwen3_custom_gqa"
+  population_size: 25
+  archive_size: 12
+  num_islands: 2
+  elite_selection_ratio: 0.25
+  exploitation_ratio: 0.7
+  exploration_ratio: 0.3
+
+# Evaluator configuration
+evaluator:
+  timeout: 300  # 5 minutes per evaluation
+  parallel_evaluations: 1
+
+# Evolution settings
+diff_based_evolution: true
+allow_full_rewrites: false
+max_code_length: 50000