Different handling of programs that fail to run? #194

benjaminy · 2025-08-09T21:01:43Z

benjaminy
Aug 9, 2025

I'm not sure how common this is, but in the experiments I've been doing (relatively big program, mostly using Gemini 2.5 pro and Claude 4 Opus), in most iterations the LLM gives back a program that at least works. But non-functional programs are somewhat common too (using nonexistent API, syntax errors). As I understand it, the common practice in this case is to give the iteration some made up low combined_score. That's ok, but it seems a bit of a waste. I wonder if anyone else thinks it might be useful to build in an additional special case iteration with the LLM where a prompt is generated that says something like "You gave me this code XXX, but it failed with this error YYY. Try to fix the error without significant logic changes".

codelion · 2025-08-10T03:42:56Z

codelion
Aug 10, 2025
Maintainer

Thanks for raising this interesting point! Actually, OpenEvolve already implements a similar mechanism through its artifacts side-channel system. When programs fail with errors, the system captures detailed error information and passes it to the LLM in the next iteration, effectively achieving what you're suggesting.

How It Works

1. Error Capture in Evaluators

When evaluation fails, evaluators can return detailed error information as artifacts. Here's an example from examples/circle_packing_with_artifacts/evaluator.py:331-349:

except Exception as e:
    error_msg = f"Evaluation failed completely: {str(e)}"
    traceback.print_exc()
    return EvaluationResult(
        metrics={
            "sum_radii": 0.0,
            # ... other metrics set to 0
        },
        artifacts={
            "stderr": error_msg,
            "traceback": traceback.format_exc(),
            "failure_stage": "program_execution", 
            "suggestion": "Check for syntax errors, import issues, or runtime exceptions",
        },
    )

2. Artifacts in Next Iteration's Prompt

These artifacts are automatically included in the next generation's prompt through openevolve/prompt/sampler.py:476-479:

# Format artifacts section if enabled and available
artifacts_section = ""
if self.config.include_artifacts and program_artifacts:
    artifacts_section = self._render_artifacts(program_artifacts)

The artifacts appear in the prompt as:

## Last Execution Output

### stderr
NameError: name 'np' is not defined

### traceback
Traceback (most recent call last):
  File "...", line 15, in <module>
    centers = np.zeros((26, 2))
NameError: name 'np' is not defined

### failure_stage
program_execution

### suggestion
Check for syntax errors, import issues, or runtime exceptions

3. Real-World Example

The examples/circle_packing_with_artifacts/README.md:89-112 shows a concrete example:

Generation N: Program fails with overlapping circles

centers[5] = [0.3, 0.3]  
centers[6] = [0.3, 0.3]  # Same position - error!

Artifacts captured:

Circles 5 and 6 overlap: dist=0.000000, r1+r2=0.200000

Generation N+1: LLM sees the overlap details and fixes it

centers[5] = [0.3, 0.3]
centers[6] = [0.5, 0.3]  # Different position - fixed!

Configuration

To enable artifacts (enabled by default):

# In config.yaml
include_artifacts: true
max_artifact_bytes: 10000

Or via environment variable:

export ENABLE_ARTIFACTS=true

Benefits Over Additional Error-Fixing Iteration

The current approach has several advantages:

No extra LLM calls - Error context is naturally included in the next evolution iteration
Maintains evolution flow - Programs still compete based on fitness, even failed ones
Rich context - The LLM sees not just the error but also performance metrics and evolution history
Flexible - Evaluators can provide domain-specific error guidance

For Your Use Case

If you're seeing many syntax errors with Gemini 2.5 Pro and Claude Opus, you might want to:

Ensure artifacts are enabled in your config
Enhance your evaluator to capture more detailed error information
Consider adding domain-specific error suggestions in the artifacts

The system already provides the error feedback mechanism you're looking for - it just needs to be leveraged through proper evaluator implementation!

Let me know if you'd like examples of how to enhance your evaluator to provide better error feedback through artifacts.

0 replies

benjaminy · 2025-08-10T12:21:04Z

benjaminy
Aug 10, 2025
Author

Thank you for your detailed response. I am already using artifacts. (Of course maybe the details matter and I could be doing something better.) One sanity check: include_artifacts defaults to True, right? So this mechanism is enabled if no value is given in the config file?

Is there any special logic/handling for the keys in the artifacts dict in your example (stderr, traceback, failure_stage, suggestion)?

I wouldn't say I'm seeing many errors. Maybe 5% of iterations (rough estimate with high variance). But when I see an iteration fail because the generated code refers to a nonexistent variable or function it feels like I would rather not include that version in the archive at all. Either give the LLM a chance to fix the uninteresting mistake(s) or "skip" that version entirely. I can see how that would be bad if the system was struggling to generate any good versions; I can see the importance of error message feedback in that case. But that's not what I'm seeing.

Your "Maintain evolution flow" point feels off to me, but I'm not sure. Say I'm trying to optimize cars and in one test a car gets a nail in its tire and completely fails. That doesn't seem like an interesting test to include. Or maybe a better analogy would be the test driver shows up drunk and crashes to no particular fault of that car design.

0 replies

codelion · 2025-08-10T14:14:04Z

codelion
Aug 10, 2025
Maintainer

Yes, include_artifacts defaults to true, so it is enabled by default.

The artifact keys (stderr, traceback, etc.) are just convention, with no special handling. The system simply renders whatever key-value pairs you return in the artifacts dict as markdown sections in the next prompt.

I understand your point about the 5% failure rate feeling wasteful. Your car analogy is apt; these are more like "assembly line errors" than meaningful design failures.

A potential solution could be a config option like:

retry_on_syntax_error: true

This would:

Detect trivial errors (undefined variables, import errors, syntax errors)

Make a single retry attempt with the error feedback before storing in the database

Only store the fixed version if successful, otherwise store with a low score as usual

This approach would preserve the evolution flow while reducing noise from trivial mistakes. The key would be distinguishing uninteresting errors (typos, missing imports) from interesting ones (algorithm does not converge, wrong approach).

Would something like that better match what you are looking for?

0 replies

benjaminy · 2025-08-10T15:24:05Z

benjaminy
Aug 10, 2025
Author

Your idea sounds promising. A suggestion: Look for a special entry in the artifacts dict like openevolve_uninteresting_failure_retry_requested: True. If that's there, do the retry. For bonus points it could keep track of the number of retry failures and if that rate is too high suggest to the application engineer that there might be a more systemic problem with the configuration/evaluator.

It seems like there's actually a natural connection here to using LLMs as the mutator (in contrast to other traditional evolutionary optimization). It's well known that LLMs just mysteriously fail every once in a while.

0 replies

benjaminy · 2025-08-14T14:42:45Z

benjaminy
Aug 14, 2025
Author

I implemented a basic version of this idea. If you're interested I can add some docs and make a proper PR. Are there other code paths that should have this logic?

https://github.com/benjaminy/openevolve-api-key-env-var/commit/ff7ed093260fc8b9ace9b4e2f56e906857bce16c

In my experience it seems quite effective with just one retry. It saves a good handful of iterations from polluting the database with uninteresting failures.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Different handling of programs that fail to run? #194

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Different handling of programs that fail to run? #194

Uh oh!

benjaminy Aug 9, 2025

Replies: 5 comments

Uh oh!

Uh oh!

codelion Aug 10, 2025 Maintainer

How It Works

1. Error Capture in Evaluators

2. Artifacts in Next Iteration's Prompt

3. Real-World Example

Configuration

Benefits Over Additional Error-Fixing Iteration

For Your Use Case

Uh oh!

benjaminy Aug 10, 2025 Author

Uh oh!

codelion Aug 10, 2025 Maintainer

Uh oh!

Uh oh!

benjaminy Aug 10, 2025 Author

Uh oh!

Uh oh!

benjaminy Aug 14, 2025 Author

benjaminy
Aug 9, 2025

codelion
Aug 10, 2025
Maintainer

benjaminy
Aug 10, 2025
Author

codelion
Aug 10, 2025
Maintainer

benjaminy
Aug 10, 2025
Author

benjaminy
Aug 14, 2025
Author