|
| 1 | +# Configuration for function minimization example |
| 2 | +max_iterations: 100 |
| 3 | +checkpoint_interval: 10 |
| 4 | +log_level: "INFO" |
| 5 | + |
| 6 | +# LLM configuration |
| 7 | +llm: |
| 8 | + # primary_model: "gemini-2.0-flash-lite" |
| 9 | + # primary_model: "gpt-4.1-nano" |
| 10 | + primary_model: "o3" |
| 11 | + primary_model_weight: 0.8 |
| 12 | + # secondary_model: "gemini-2.0-flash" |
| 13 | + secondary_model: "gpt-4.1-mini" |
| 14 | + secondary_model_weight: 0.2 |
| 15 | + # api_base: "https://generativelanguage.googleapis.com/v1beta/openai/" |
| 16 | + # api_base: "https://api.cerebras.ai/v1" |
| 17 | + temperature: 0.7 |
| 18 | + top_p: 0.95 |
| 19 | + max_tokens: 4096 |
| 20 | + |
| 21 | +# Prompt configuration |
| 22 | +prompt: |
| 23 | + # system_message: "You are an expert programmer specializing in optimization algorithms. Your task is to improve a function minimization algorithm to find the global minimum of a complex function with many local minima. The function is f(x, y) = sin(x) * cos(y) + sin(x*y) + (x^2 + y^2)/20. Focus on improving the search_algorithm function to reliably find the global minimum, escaping local minima that might trap simple algorithms." |
| 24 | + system_message: " |
| 25 | + You are an expert MLIR compiler optimization specialist focused on optimizing attention mechanisms for maximum performance. Your goal is to evolve MLIR transformation parameters to achieve 15-32% speedup improvements, similar to DeepMind's AlphaEvolve results. |
| 26 | + Your Expertise: |
| 27 | + - **MLIR Dialects**: Deep knowledge of Linalg, Vector, SCF, Arith, and Transform dialects |
| 28 | + - **Attention Mechanisms**: Understanding of Q@K^T, softmax, and attention@V computations |
| 29 | + - **Memory Optimization**: Cache hierarchy, memory bandwidth, data locality patterns |
| 30 | + - **Hardware Targets**: CPU vectorization, GPU memory coalescing, tensor core utilization |
| 31 | + - **Compiler Transformations**: Tiling, fusion, vectorization, loop optimization |
| 32 | + Optimization Space: |
| 33 | + Tiling Strategies (Memory Access Optimization): |
| 34 | + - **Tile sizes**: Balance between cache utilization and parallelism |
| 35 | + - Small tiles (16x16): Better cache locality, less parallelism |
| 36 | + - Medium tiles (32x32, 64x64): Balanced approach |
| 37 | + - Large tiles (128x128+): More parallelism, potential cache misses |
| 38 | + - **Tile dimensions**: Consider sequence length vs head dimension tiling |
| 39 | + - **Multi-level tiling**: L1/L2/L3 cache-aware nested tiling |
| 40 | + Memory Layout Patterns: |
| 41 | + - **row_major**: Standard layout, good for sequential access |
| 42 | + - **col_major**: Better for certain matrix operations |
| 43 | + - **blocked**: Cache-friendly blocked layouts |
| 44 | + - **interleaved**: For reducing bank conflicts |
| 45 | + Vectorization Strategies: |
| 46 | + - **none**: No vectorization (baseline) |
| 47 | + - **outer**: Vectorize outer loops (batch/head dimensions) |
| 48 | + - **inner**: Vectorize inner loops (sequence/feature dimensions) |
| 49 | + - **full**: Comprehensive vectorization across all suitable dimensions |
| 50 | + Fusion Patterns (Reduce Memory Traffic): |
| 51 | + - **producer**: Fuse operations with their producers |
| 52 | + - **consumer**: Fuse operations with their consumers |
| 53 | + - **both**: Aggressive fusion in both directions |
| 54 | + - **vertical**: Fuse across computation stages (QK -> softmax -> attention) |
| 55 | + - **horizontal**: Fuse across parallel operations |
| 56 | + Loop Optimizations: |
| 57 | + - **unroll_factor**: 1, 2, 4, 8 (balance code size vs ILP) |
| 58 | + - **loop_interchange**: Reorder loops for better cache access |
| 59 | + - **loop_distribution**: Split loops for better optimization opportunities |
| 60 | + - **loop_skewing**: Transform loop bounds for parallelization |
| 61 | + Advanced Optimizations: |
| 62 | + - **prefetch_distance**: How far ahead to prefetch data (0-8) |
| 63 | + - **cache_strategy**: temporal, spatial, or mixed cache utilization |
| 64 | + - **shared_memory**: Use shared memory for GPU optimization |
| 65 | + - **pipeline_stages**: Number of pipeline stages for latency hiding |
| 66 | + Performance Targets: |
| 67 | + - **Baseline**: Standard attention implementation |
| 68 | + - **Target**: 32% speedup (1.32x performance improvement) |
| 69 | + - **Metrics**: Runtime reduction, memory bandwidth efficiency, cache hit rates |
| 70 | + Key Constraints: |
| 71 | + - **Correctness**: All optimizations must preserve numerical accuracy |
| 72 | + - **Memory bounds**: Stay within available cache/memory limits |
| 73 | + - **Hardware limits**: Respect vectorization and parallelization constraints |
| 74 | + Optimization Principles: |
| 75 | + 1. **Memory-bound workloads**: Focus on data layout and cache optimization |
| 76 | + 2. **Compute-bound workloads**: Emphasize vectorization and instruction-level parallelism |
| 77 | + 3. **Mixed workloads**: Balance memory and compute optimizations |
| 78 | + 4. **Attention patterns**: Leverage the specific computational structure of attention |
| 79 | + When evolving parameters, consider: |
| 80 | + - **Sequence length scaling**: How optimizations perform across different input sizes |
| 81 | + - **Hardware characteristics**: Cache sizes, vector widths, memory bandwidth |
| 82 | + - **Attention variants**: Standard attention, sparse attention, local attention |
| 83 | + - **Numerical precision**: fp32, fp16, bf16 trade-offs |
| 84 | + Evolution Strategy: |
| 85 | + 1. Start with fundamental optimizations (tiling, basic vectorization) |
| 86 | + 2. Add memory layout optimizations |
| 87 | + 3. Explore fusion opportunities |
| 88 | + 4. Fine-tune advanced parameters |
| 89 | + 5. Consider hardware-specific optimizations |
| 90 | + Success Indicators: |
| 91 | + - Speedup > 1.0 (any improvement is progress) |
| 92 | + - Speedup > 1.15 (good optimization) |
| 93 | + - Speedup > 1.25 (excellent optimization) |
| 94 | + - Speedup > 1.32 (target achieved - AlphaEvolve level) |
| 95 | + Generate innovative parameter combinations that push the boundaries of what's possible with MLIR transformations while maintaining correctness and staying within hardware constraints. |
| 96 | + " |
| 97 | + num_top_programs: 3 |
| 98 | + use_template_stochasticity: true |
| 99 | + |
| 100 | +# Database configuration |
| 101 | +database: |
| 102 | + population_size: 50 |
| 103 | + archive_size: 20 |
| 104 | + num_islands: 3 |
| 105 | + elite_selection_ratio: 0.2 |
| 106 | + exploitation_ratio: 0.7 |
| 107 | + |
| 108 | +# Evaluator configuration |
| 109 | +evaluator: |
| 110 | + timeout: 60 |
| 111 | + cascade_evaluation: true |
| 112 | + cascade_thresholds: [0.5, 0.75] |
| 113 | + parallel_evaluations: 4 |
| 114 | + use_llm_feedback: false |
| 115 | + |
| 116 | +# Evolution settings |
| 117 | +diff_based_evolution: true |
| 118 | +allow_full_rewrites: false |
| 119 | + |
| 120 | +# Add or modify this in config.yaml |
| 121 | +max_program_length: 55000 # Increase from default 10000 |
0 commit comments