Skip to content

Commit bbb437b

Browse files
committed
Merge branch 'main' into pr/181
2 parents 6bb16a1 + 5a5bd3b commit bbb437b

File tree

58 files changed

+6758
-1132
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+6758
-1132
lines changed

examples/attention_optimization/README.md

Lines changed: 541 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Configuration for function minimization example
2+
max_iterations: 100
3+
checkpoint_interval: 10
4+
log_level: "INFO"
5+
6+
# LLM configuration
7+
llm:
8+
# primary_model: "gemini-2.0-flash-lite"
9+
# primary_model: "gpt-4.1-nano"
10+
primary_model: "o3"
11+
primary_model_weight: 0.8
12+
# secondary_model: "gemini-2.0-flash"
13+
secondary_model: "gpt-4.1-mini"
14+
secondary_model_weight: 0.2
15+
# api_base: "https://generativelanguage.googleapis.com/v1beta/openai/"
16+
# api_base: "https://api.cerebras.ai/v1"
17+
temperature: 0.7
18+
top_p: 0.95
19+
max_tokens: 4096
20+
21+
# Prompt configuration
22+
prompt:
23+
# system_message: "You are an expert programmer specializing in optimization algorithms. Your task is to improve a function minimization algorithm to find the global minimum of a complex function with many local minima. The function is f(x, y) = sin(x) * cos(y) + sin(x*y) + (x^2 + y^2)/20. Focus on improving the search_algorithm function to reliably find the global minimum, escaping local minima that might trap simple algorithms."
24+
system_message: "
25+
You are an expert MLIR compiler optimization specialist focused on optimizing attention mechanisms for maximum performance. Your goal is to evolve MLIR transformation parameters to achieve 15-32% speedup improvements, similar to DeepMind's AlphaEvolve results.
26+
Your Expertise:
27+
- **MLIR Dialects**: Deep knowledge of Linalg, Vector, SCF, Arith, and Transform dialects
28+
- **Attention Mechanisms**: Understanding of Q@K^T, softmax, and attention@V computations
29+
- **Memory Optimization**: Cache hierarchy, memory bandwidth, data locality patterns
30+
- **Hardware Targets**: CPU vectorization, GPU memory coalescing, tensor core utilization
31+
- **Compiler Transformations**: Tiling, fusion, vectorization, loop optimization
32+
Optimization Space:
33+
Tiling Strategies (Memory Access Optimization):
34+
- **Tile sizes**: Balance between cache utilization and parallelism
35+
- Small tiles (16x16): Better cache locality, less parallelism
36+
- Medium tiles (32x32, 64x64): Balanced approach
37+
- Large tiles (128x128+): More parallelism, potential cache misses
38+
- **Tile dimensions**: Consider sequence length vs head dimension tiling
39+
- **Multi-level tiling**: L1/L2/L3 cache-aware nested tiling
40+
Memory Layout Patterns:
41+
- **row_major**: Standard layout, good for sequential access
42+
- **col_major**: Better for certain matrix operations
43+
- **blocked**: Cache-friendly blocked layouts
44+
- **interleaved**: For reducing bank conflicts
45+
Vectorization Strategies:
46+
- **none**: No vectorization (baseline)
47+
- **outer**: Vectorize outer loops (batch/head dimensions)
48+
- **inner**: Vectorize inner loops (sequence/feature dimensions)
49+
- **full**: Comprehensive vectorization across all suitable dimensions
50+
Fusion Patterns (Reduce Memory Traffic):
51+
- **producer**: Fuse operations with their producers
52+
- **consumer**: Fuse operations with their consumers
53+
- **both**: Aggressive fusion in both directions
54+
- **vertical**: Fuse across computation stages (QK -> softmax -> attention)
55+
- **horizontal**: Fuse across parallel operations
56+
Loop Optimizations:
57+
- **unroll_factor**: 1, 2, 4, 8 (balance code size vs ILP)
58+
- **loop_interchange**: Reorder loops for better cache access
59+
- **loop_distribution**: Split loops for better optimization opportunities
60+
- **loop_skewing**: Transform loop bounds for parallelization
61+
Advanced Optimizations:
62+
- **prefetch_distance**: How far ahead to prefetch data (0-8)
63+
- **cache_strategy**: temporal, spatial, or mixed cache utilization
64+
- **shared_memory**: Use shared memory for GPU optimization
65+
- **pipeline_stages**: Number of pipeline stages for latency hiding
66+
Performance Targets:
67+
- **Baseline**: Standard attention implementation
68+
- **Target**: 32% speedup (1.32x performance improvement)
69+
- **Metrics**: Runtime reduction, memory bandwidth efficiency, cache hit rates
70+
Key Constraints:
71+
- **Correctness**: All optimizations must preserve numerical accuracy
72+
- **Memory bounds**: Stay within available cache/memory limits
73+
- **Hardware limits**: Respect vectorization and parallelization constraints
74+
Optimization Principles:
75+
1. **Memory-bound workloads**: Focus on data layout and cache optimization
76+
2. **Compute-bound workloads**: Emphasize vectorization and instruction-level parallelism
77+
3. **Mixed workloads**: Balance memory and compute optimizations
78+
4. **Attention patterns**: Leverage the specific computational structure of attention
79+
When evolving parameters, consider:
80+
- **Sequence length scaling**: How optimizations perform across different input sizes
81+
- **Hardware characteristics**: Cache sizes, vector widths, memory bandwidth
82+
- **Attention variants**: Standard attention, sparse attention, local attention
83+
- **Numerical precision**: fp32, fp16, bf16 trade-offs
84+
Evolution Strategy:
85+
1. Start with fundamental optimizations (tiling, basic vectorization)
86+
2. Add memory layout optimizations
87+
3. Explore fusion opportunities
88+
4. Fine-tune advanced parameters
89+
5. Consider hardware-specific optimizations
90+
Success Indicators:
91+
- Speedup > 1.0 (any improvement is progress)
92+
- Speedup > 1.15 (good optimization)
93+
- Speedup > 1.25 (excellent optimization)
94+
- Speedup > 1.32 (target achieved - AlphaEvolve level)
95+
Generate innovative parameter combinations that push the boundaries of what's possible with MLIR transformations while maintaining correctness and staying within hardware constraints.
96+
"
97+
num_top_programs: 3
98+
use_template_stochasticity: true
99+
100+
# Database configuration
101+
database:
102+
population_size: 50
103+
archive_size: 20
104+
num_islands: 3
105+
elite_selection_ratio: 0.2
106+
exploitation_ratio: 0.7
107+
108+
# Evaluator configuration
109+
evaluator:
110+
timeout: 60
111+
cascade_evaluation: true
112+
cascade_thresholds: [0.5, 0.75]
113+
parallel_evaluations: 4
114+
use_llm_feedback: false
115+
116+
# Evolution settings
117+
diff_based_evolution: true
118+
allow_full_rewrites: false
119+
120+
# Add or modify this in config.yaml
121+
max_program_length: 55000 # Increase from default 10000
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# OpenEvolve configuration for MLIR attention optimization
2+
3+
# LLM configuration
4+
llm:
5+
primary_model: "gpt-4.1-nano"
6+
# secondary_models: ["gpt-4.1-mini"]
7+
temperature: 0.7
8+
max_tokens: 2048
9+
10+
# Evolution parameters
11+
evolution:
12+
max_iterations: 500
13+
population_size: 50
14+
mutation_rate: 0.15
15+
crossover_rate: 0.8
16+
selection_strategy: "tournament"
17+
tournament_size: 5
18+
19+
# Database configuration
20+
database:
21+
population_size: 100
22+
num_islands: 3
23+
migration_rate: 0.1
24+
25+
# Evaluation settings
26+
evaluation:
27+
timeout_seconds: 120
28+
max_retries: 3
29+
parallel_evaluations: 4
30+
31+
# Checkpoint settings
32+
checkpoints:
33+
enabled: true
34+
interval: 10
35+
keep_best: true
36+
save_all_programs: false
37+
38+
# Optimization targets
39+
optimization:
40+
target_metric: "speedup"
41+
target_value: 1.32 # 32% speedup like AlphaEvolve paper
42+
minimize: false
43+
convergence_threshold: 0.001
44+
early_stopping_patience: 50
45+
46+
# Logging
47+
logging:
48+
level: "INFO"
49+
save_logs: true
50+
verbose: true

0 commit comments

Comments
 (0)