wangcheng0825
diff --git a/‎examples/algotune/affine_transform_2d/config.yaml‎
Lines changed: 5 additions & 61 deletions b/‎examples/algotune/affine_transform_2d/config.yaml‎
Lines changed: 5 additions & 61 deletions
diff --git a/‎examples/algotune/affine_transform_2d/initial_program.py‎
Lines changed: 5 additions & 4 deletions b/‎examples/algotune/affine_transform_2d/initial_program.py‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎examples/algotune/convolve2d_full_fill/config.yaml‎
Lines changed: 5 additions & 23 deletions b/‎examples/algotune/convolve2d_full_fill/config.yaml‎
Lines changed: 5 additions & 23 deletions
diff --git a/‎examples/algotune/convolve2d_full_fill/initial_program.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/algotune/convolve2d_full_fill/initial_program.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/algotune/eigenvectors_complex/config.yaml‎
Lines changed: 5 additions & 62 deletions b/‎examples/algotune/eigenvectors_complex/config.yaml‎
Lines changed: 5 additions & 62 deletions
diff --git a/‎examples/algotune/fft_cmplx_scipy_fftpack/config.yaml‎
Lines changed: 6 additions & 64 deletions b/‎examples/algotune/fft_cmplx_scipy_fftpack/config.yaml‎
Lines changed: 6 additions & 64 deletions
@@ -14,7 +14,9 @@ llm:
   api_base: "https://openrouter.ai/api/v1"
   models:
     - name: "google/gemini-2.5-flash"
-      weight: 1.0
+      weight: 0.8
+    - name: "google/gemini-2.5-pro"
+      weight: 0.2
 
   temperature: 0.4  # Optimal (better than 0.2, 0.6, 0.8)
   max_tokens: 128000  # Increased from 16000 for much richer context
@@ -68,16 +70,10 @@ prompt:
 
     Focus on improving the solve method to correctly handle the input format and produce valid solutions efficiently. Your solution will be compared against the reference AlgoTune baseline implementation to measure speedup and correctness.
     
-    
-    
-
-
     PERFORMANCE OPTIMIZATION OPPORTUNITIES:
     You have access to high-performance libraries that can provide significant speedups:
     
     • **JAX** - JIT compilation for numerical computations
-      Key insight: Functions should be defined outside classes for JIT compatibility
-      For jnp.roots(), consider using strip_zeros=False in JIT contexts
     
     • **Numba** - Alternative JIT compilation, often simpler to use
     
@@ -86,59 +82,7 @@ prompt:
     
     • **Vectorization** - Look for opportunities to replace loops with array operations
     
-    EXPLORATION STRATEGY:
-    1. Profile to identify bottlenecks first
-    2. Consider multiple optimization approaches for the same problem
-    3. Try both library-specific optimizations and algorithmic improvements
-    4. Test different numerical libraries to find the best fit
-
-    
-    PROBLEM-SPECIFIC OPTIMIZATION HINTS:
-    2D affine transformations - PROVEN OPTIMIZATIONS (2.3x speedup achieved):
-    
-    **INTERPOLATION ORDER REDUCTION** (Most Effective - 30-40% speedup):
-    • Use order=1 (linear) instead of order=3 (cubic) for scipy.ndimage.affine_transform
-    • Linear interpolation is often sufficient for most transformations
-    • Code: scipy.ndimage.affine_transform(image, matrix, order=1, mode="constant")
-    • The accuracy loss is minimal for most image transformations
-    
-    **PRECISION OPTIMIZATION** (20-30% speedup):
-    • Convert images to float32 instead of float64
-    • Code: image_float32 = image.astype(np.float32)
-    • This leverages faster SIMD operations and reduces memory bandwidth
-    • Combine with order=1 for maximum benefit
-    
-    **APPLE SILICON M4 OPTIMIZATIONS** (5-10% additional speedup):
-    • Use C-contiguous arrays for image processing
-    • Code: image = np.ascontiguousarray(image.astype(np.float32))
-    • Detect with: platform.processor() == 'arm' and platform.system() == 'Darwin'
-    • Apple's Accelerate framework optimizes spline interpolation for these layouts
-    
-    **COMPLETE OPTIMIZED EXAMPLE**:
-    ```python
-    import platform
-    IS_APPLE_SILICON = (platform.processor() == 'arm' and platform.system() == 'Darwin')
-    
-    # Convert to float32 for speed
-    image_float32 = image.astype(np.float32)
-    matrix_float32 = matrix.astype(np.float32)
-    
-    if IS_APPLE_SILICON:
-        image_float32 = np.ascontiguousarray(image_float32)
-        matrix_float32 = np.ascontiguousarray(matrix_float32)
-    
-    # Use order=1 (linear) instead of order=3 (cubic)
-    transformed = scipy.ndimage.affine_transform(
-        image_float32, matrix_float32, order=1, mode="constant"
-    )
-    ```
-    
-    **AVOID**:
-    • Complex JIT compilation (JAX/Numba) - overhead exceeds benefits for this task
-    • OpenCV - adds dependency without consistent performance gain
-    • Order=3 (cubic) interpolation unless accuracy is critical
-    
-  num_top_programs: 10     # Increased from 3-5 for richer learning context
+  num_top_programs: 5     # Increased from 3-5 for richer learning context
   num_diverse_programs: 5  # Increased from 2 for more diverse exploration
   include_artifacts: true  # +20.7% improvement
 
@@ -170,7 +114,7 @@ evaluator:
   cascade_thresholds: [0.5, 0.8]
 
   # Parallel evaluations
-  parallel_evaluations: 1
+  parallel_evaluations: 4
 
 # AlgoTune task-specific configuration
 algotune:
 
@@ -39,14 +39,15 @@
 
 OPTIMIZATION OPPORTUNITIES:
 Consider these algorithmic improvements for significant performance gains:
+- Lower-order interpolation: Try order=0 (nearest) or order=1 (linear) vs default order=3 (cubic)
+  Linear interpolation (order=1) often provides best speed/quality balance with major speedups
+- Precision optimization: float32 often sufficient vs float64, especially with lower interpolation orders
 - Separable transforms: Check if the transformation can be decomposed into separate x and y operations
 - Cache-friendly memory access patterns: Process data in blocks to improve cache utilization
-- Pre-computed interpolation coefficients: For repeated similar transformations
-- Direct coordinate mapping: Avoid intermediate coordinate calculations for simple transforms
 - JIT compilation: Use JAX or Numba for numerical operations that are Python-bottlenecked
-- Batch processing: Process multiple images or regions simultaneously for amortized overhead
-- Alternative interpolation methods: Lower-order interpolation for speed vs quality tradeoffs
+- Direct coordinate mapping: Avoid intermediate coordinate calculations for simple transforms
 - Hardware optimizations: Leverage SIMD instructions through vectorized operations
+- Batch processing: Process multiple images or regions simultaneously for amortized overhead
 
 This is the initial implementation that will be evolved by OpenEvolve.
 The solve method will be improved through evolution.
 
@@ -14,7 +14,9 @@ llm:
   api_base: "https://openrouter.ai/api/v1"
   models:
     - name: "google/gemini-2.5-flash"
-      weight: 1.0
+      weight: 0.8
+    - name: "google/gemini-2.5-pro"
+      weight: 0.2
 
   temperature: 0.4  # Optimal (better than 0.2, 0.6, 0.8)
   max_tokens: 128000  # Increased from 16000 for much richer context
@@ -70,17 +72,11 @@ prompt:
     The output is a 2D array representing the convolution result.
 
     Focus on improving the solve method to correctly handle the input format and produce valid solutions efficiently. Your solution will be compared against the reference AlgoTune baseline implementation to measure speedup and correctness.
-    
-    
-    
-
 
     PERFORMANCE OPTIMIZATION OPPORTUNITIES:
     You have access to high-performance libraries that can provide significant speedups:
     
     • **JAX** - JIT compilation for numerical computations
-      Key insight: Functions should be defined outside classes for JIT compatibility
-      For jnp.roots(), consider using strip_zeros=False in JIT contexts
     
     • **Numba** - Alternative JIT compilation, often simpler to use
     
@@ -89,21 +85,7 @@ prompt:
     
     • **Vectorization** - Look for opportunities to replace loops with array operations
     
-    EXPLORATION STRATEGY:
-    1. Profile to identify bottlenecks first
-    2. Consider multiple optimization approaches for the same problem
-    3. Try both library-specific optimizations and algorithmic improvements
-    4. Test different numerical libraries to find the best fit
-
-    
-    PROBLEM-SPECIFIC OPTIMIZATION HINTS:
-    This task involves 2D convolution in 'full' mode - consider:
-    • FFT-based convolution algorithms (O(n log n) vs O(n²))
-    • scipy.signal functions may have optimized implementations
-    • JAX also has FFT operations if JIT compilation benefits outweigh library optimizations
-    • Memory layout and padding strategies can impact performance
-    
-  num_top_programs: 10     # Increased from 3-5 for richer learning context
+  num_top_programs: 5     # Increased from 3-5 for richer learning context
   num_diverse_programs: 5  # Increased from 2 for more diverse exploration
   include_artifacts: true  # +20.7% improvement
 
@@ -135,7 +117,7 @@ evaluator:
   cascade_thresholds: [0.5, 0.8]
 
   # Parallel evaluations
-  parallel_evaluations: 1
+  parallel_evaluations: 4
 
 # AlgoTune task-specific configuration
 algotune:
 
@@ -37,7 +37,7 @@
 
 OPTIMIZATION OPPORTUNITIES:
 Consider these algorithmic improvements for massive performance gains:
-- FFT-based convolution: Use scipy.signal.fftconvolve for O(N²log N) complexity vs O(N⁴) direct convolution
+- Alternative convolution algorithms: Consider different approaches with varying computational complexity
 - Overlap-add/overlap-save methods: For extremely large inputs that don't fit in memory
 - Separable kernels: If the kernel can be decomposed into 1D convolutions (rank-1 factorization)
 - Winograd convolution: For small kernels (3x3, 5x5) with fewer multiplications
 
@@ -14,7 +14,9 @@ llm:
   api_base: "https://openrouter.ai/api/v1"
   models:
     - name: "google/gemini-2.5-flash"
-      weight: 1.0
+      weight: 0.8
+    - name: "google/gemini-2.5-pro"
+      weight: 0.2
 
   temperature: 0.4  # Optimal (better than 0.2, 0.6, 0.8)
   max_tokens: 128000  # Increased from 16000 for much richer context
@@ -76,17 +78,11 @@ prompt:
       - eigenvectors is an array of n eigenvectors, each of length n, representing the eigenvector corresponding to the eigenvalue at the same index.
 
     Focus on improving the solve method to correctly handle the input format and produce valid solutions efficiently. Your solution will be compared against the reference AlgoTune baseline implementation to measure speedup and correctness.
-    
-    
-    
-
 
     PERFORMANCE OPTIMIZATION OPPORTUNITIES:
     You have access to high-performance libraries that can provide significant speedups:
     
     • **JAX** - JIT compilation for numerical computations
-      Key insight: Functions should be defined outside classes for JIT compatibility
-      For jnp.roots(), consider using strip_zeros=False in JIT contexts
     
     • **Numba** - Alternative JIT compilation, often simpler to use
     
@@ -95,60 +91,7 @@ prompt:
     
     • **Vectorization** - Look for opportunities to replace loops with array operations
     
-    EXPLORATION STRATEGY:
-    1. Profile to identify bottlenecks first
-    2. Consider multiple optimization approaches for the same problem
-    3. Try both library-specific optimizations and algorithmic improvements
-    4. Test different numerical libraries to find the best fit
-
-    
-    PROBLEM-SPECIFIC OPTIMIZATION HINTS:
-    Computing eigenvectors of complex matrices - PROVEN OPTIMIZATIONS (1.4x speedup achieved):
-    
-    **KEY INSIGHT**: The input matrix is REAL (not complex), but the original algorithm treats it as complex.
-    Post-processing (sorting/normalization) can be heavily optimized.
-    
-    **VECTORIZED POST-PROCESSING** (Most Effective - 35% speedup):
-    • Use numpy.argsort instead of Python's sort for eigenvalue ordering
-    • Vectorize normalization using broadcasting instead of loops
-    • Use advanced indexing to avoid memory copies
-    
-    **OPTIMIZED IMPLEMENTATION**:
-    ```python
-    # Use numpy.linalg.eig (faster than scipy for small/medium matrices)
-    eigenvalues, eigenvectors = np.linalg.eig(A)
-    
-    # VECTORIZED SORTING: Use numpy.lexsort (much faster than Python sort)
-    sort_indices = np.lexsort((-eigenvalues.imag, -eigenvalues.real))
-    sorted_eigenvectors = eigenvectors[:, sort_indices]  # No copying
-    
-    # VECTORIZED NORMALIZATION: All columns at once
-    norms = np.linalg.norm(sorted_eigenvectors, axis=0)
-    valid_mask = norms > 1e-12
-    sorted_eigenvectors[:, valid_mask] /= norms[valid_mask]
-    
-    # EFFICIENT CONVERSION: Use .T.tolist() instead of Python loops
-    return sorted_eigenvectors.T.tolist()
-    ```
-    
-    **MEMORY LAYOUT OPTIMIZATION** (5-10% additional on M4):
-    • Use C-contiguous arrays for numpy.linalg.eig
-    • Code: A = np.ascontiguousarray(A.astype(np.float64))
-    • Detect Apple Silicon: platform.processor() == 'arm' and platform.system() == 'Darwin'
-    
-    **KEY OPTIMIZATIONS**:
-    • Replace Python loops with numpy vectorized operations
-    • Eliminate list() and zip() operations in sorting
-    • Use advanced indexing instead of creating copies
-    • Stay in numpy throughout, convert to list only at the end
-    
-    **AVOID**:
-    • Python sorting with lambda functions - extremely slow
-    • eigenvectors.T - creates unnecessary matrix copy
-    • Loop-based normalization - vectorize instead
-    • scipy.linalg.eig for small matrices - has more overhead than numpy
-    
-  num_top_programs: 10     # Increased from 3-5 for richer learning context
+  num_top_programs: 5     # Increased from 3-5 for richer learning context
   num_diverse_programs: 5  # Increased from 2 for more diverse exploration
   include_artifacts: true  # +20.7% improvement
 
@@ -180,7 +123,7 @@ evaluator:
   cascade_thresholds: [0.5, 0.8]
 
   # Parallel evaluations
-  parallel_evaluations: 1
+  parallel_evaluations: 4
 
 # AlgoTune task-specific configuration
 algotune:
 
@@ -14,7 +14,9 @@ llm:
   api_base: "https://openrouter.ai/api/v1"
   models:
     - name: "google/gemini-2.5-flash"
-      weight: 1.0
+      weight: 0.8
+    - name: "google/gemini-2.5-pro"
+      weight: 0.2
 
   temperature: 0.4  # Optimal (better than 0.2, 0.6, 0.8)
   max_tokens: 128000  # Increased from 16000 for much richer context
@@ -66,22 +68,16 @@ prompt:
 
     This task requires computing the N-dimensional Fast Fourier Transform (FFT) of a complex-valued matrix.  
     The FFT is a mathematical technique that converts data from the spatial (or time) domain into the frequency domain, revealing both the magnitude and phase of the frequency components.  
-    The input is a square matrix of size n×n, where each element is a complex number containing both real and imaginary parts.  
+    The input is a square matrix of size nxn, where each element is a complex number containing both real and imaginary parts.  
     The output is a square matrix of the same size, where each entry is a complex number representing a specific frequency component of the input data, including its amplitude and phase.  
     This transformation is crucial in analyzing signals and data with inherent complex properties.
 
     Focus on improving the solve method to correctly handle the input format and produce valid solutions efficiently. Your solution will be compared against the reference AlgoTune baseline implementation to measure speedup and correctness.
-    
-    
-    
-
 
     PERFORMANCE OPTIMIZATION OPPORTUNITIES:
     You have access to high-performance libraries that can provide significant speedups:
     
     • **JAX** - JIT compilation for numerical computations
-      Key insight: Functions should be defined outside classes for JIT compatibility
-      For jnp.roots(), consider using strip_zeros=False in JIT contexts
     
     • **Numba** - Alternative JIT compilation, often simpler to use
     
@@ -90,61 +86,7 @@ prompt:
     
     • **Vectorization** - Look for opportunities to replace loops with array operations
     
-    EXPLORATION STRATEGY:
-    1. Profile to identify bottlenecks first
-    2. Consider multiple optimization approaches for the same problem
-    3. Try both library-specific optimizations and algorithmic improvements
-    4. Test different numerical libraries to find the best fit
-
-    
-    PROBLEM-SPECIFIC OPTIMIZATION HINTS:
-    Complex 2D FFT operations - PROVEN OPTIMIZATIONS (1.2x speedup achieved):
-    
-    **COMPLEX PRECISION REDUCTION** (Most Effective - 10-20% speedup):
-    • Use complex64 instead of complex128 for FFT computation
-    • Code: problem_64 = problem_array.astype(np.complex64)
-    • Then: result = scipy.fftpack.fftn(problem_64)
-    • Convert back to complex128 after computation for compatibility
-    • This reduces memory bandwidth and leverages faster SIMD operations
-    
-    **MEMORY LAYOUT OPTIMIZATION FOR M4** (5-10% additional speedup):
-    • Use Fortran-ordered arrays for optimal FFTPACK performance
-    • Code: problem_opt = np.asfortranarray(problem.astype(np.complex64))
-    • Detect Apple Silicon: platform.processor() == 'arm' and platform.system() == 'Darwin'
-    • FFTPACK internally uses Fortran routines that benefit from this layout
-    
-    **COMPLETE OPTIMIZED EXAMPLE**:
-    ```python
-    import platform
-    import scipy.fftpack as fftpack
-    
-    IS_APPLE_SILICON = (platform.processor() == 'arm' and platform.system() == 'Darwin')
-    
-    # Convert to complex64 for speed
-    problem_64 = np.array(problem, dtype=np.complex64)
-    
-    if IS_APPLE_SILICON:
-        # Fortran layout for optimal FFTPACK performance
-        problem_64 = np.asfortranarray(problem_64)
-    
-    # Perform FFT with reduced precision
-    result_64 = fftpack.fftn(problem_64)
-    
-    # Convert back to complex128 for precision/compatibility
-    result = result_64.astype(np.complex128)
-    ```
-    
-    **IMPORTANT NOTES**:
-    • scipy.fftpack.fftn is already highly optimized - focus on precision/layout
-    • numpy.fft.fftn is typically slower than scipy.fftpack for this task
-    • The tolerance in is_solution allows for complex64 precision (1e-5)
-    
-    **AVOID**:
-    • JAX/Numba JIT - process overhead exceeds FFT benefits
-    • numpy.fft instead of scipy.fftpack - consistently slower
-    • Complex128 throughout - unnecessary precision for most FFT applications
-    
-  num_top_programs: 10     # Increased from 3-5 for richer learning context
+  num_top_programs: 5     # Increased from 3-5 for richer learning context
   num_diverse_programs: 5  # Increased from 2 for more diverse exploration
   include_artifacts: true  # +20.7% improvement
 
@@ -176,7 +118,7 @@ evaluator:
   cascade_thresholds: [0.5, 0.8]
 
   # Parallel evaluations
-  parallel_evaluations: 1
+  parallel_evaluations: 4
 
 # AlgoTune task-specific configuration
 algotune: