Replies: 5 comments
-
Thanks for raising this interesting point! Actually, OpenEvolve already implements a similar mechanism through its artifacts side-channel system. When programs fail with errors, the system captures detailed error information and passes it to the LLM in the next iteration, effectively achieving what you're suggesting. How It Works1. Error Capture in EvaluatorsWhen evaluation fails, evaluators can return detailed error information as artifacts. Here's an example from except Exception as e:
error_msg = f"Evaluation failed completely: {str(e)}"
traceback.print_exc()
return EvaluationResult(
metrics={
"sum_radii": 0.0,
# ... other metrics set to 0
},
artifacts={
"stderr": error_msg,
"traceback": traceback.format_exc(),
"failure_stage": "program_execution",
"suggestion": "Check for syntax errors, import issues, or runtime exceptions",
},
) 2. Artifacts in Next Iteration's PromptThese artifacts are automatically included in the next generation's prompt through # Format artifacts section if enabled and available
artifacts_section = ""
if self.config.include_artifacts and program_artifacts:
artifacts_section = self._render_artifacts(program_artifacts) The artifacts appear in the prompt as:
3. Real-World ExampleThe Generation N: Program fails with overlapping circles centers[5] = [0.3, 0.3]
centers[6] = [0.3, 0.3] # Same position - error! Artifacts captured:
Generation N+1: LLM sees the overlap details and fixes it centers[5] = [0.3, 0.3]
centers[6] = [0.5, 0.3] # Different position - fixed! ConfigurationTo enable artifacts (enabled by default): # In config.yaml
include_artifacts: true
max_artifact_bytes: 10000 Or via environment variable: export ENABLE_ARTIFACTS=true Benefits Over Additional Error-Fixing IterationThe current approach has several advantages:
For Your Use CaseIf you're seeing many syntax errors with Gemini 2.5 Pro and Claude Opus, you might want to:
The system already provides the error feedback mechanism you're looking for - it just needs to be leveraged through proper evaluator implementation! Let me know if you'd like examples of how to enhance your evaluator to provide better error feedback through artifacts. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your detailed response. I am already using Is there any special logic/handling for the keys in the I wouldn't say I'm seeing many errors. Maybe 5% of iterations (rough estimate with high variance). But when I see an iteration fail because the generated code refers to a nonexistent variable or function it feels like I would rather not include that version in the archive at all. Either give the LLM a chance to fix the uninteresting mistake(s) or "skip" that version entirely. I can see how that would be bad if the system was struggling to generate any good versions; I can see the importance of error message feedback in that case. But that's not what I'm seeing. Your "Maintain evolution flow" point feels off to me, but I'm not sure. Say I'm trying to optimize cars and in one test a car gets a nail in its tire and completely fails. That doesn't seem like an interesting test to include. Or maybe a better analogy would be the test driver shows up drunk and crashes to no particular fault of that car design. |
Beta Was this translation helpful? Give feedback.
-
Yes, include_artifacts defaults to true, so it is enabled by default. The artifact keys (stderr, traceback, etc.) are just convention, with no special handling. The system simply renders whatever key-value pairs you return in the artifacts dict as markdown sections in the next prompt. I understand your point about the 5% failure rate feeling wasteful. Your car analogy is apt; these are more like "assembly line errors" than meaningful design failures. A potential solution could be a config option like: retry_on_syntax_error: true This would: Detect trivial errors (undefined variables, import errors, syntax errors) Make a single retry attempt with the error feedback before storing in the database Only store the fixed version if successful, otherwise store with a low score as usual This approach would preserve the evolution flow while reducing noise from trivial mistakes. The key would be distinguishing uninteresting errors (typos, missing imports) from interesting ones (algorithm does not converge, wrong approach). Would something like that better match what you are looking for? |
Beta Was this translation helpful? Give feedback.
-
Your idea sounds promising. A suggestion: Look for a special entry in the artifacts dict like It seems like there's actually a natural connection here to using LLMs as the mutator (in contrast to other traditional evolutionary optimization). It's well known that LLMs just mysteriously fail every once in a while. |
Beta Was this translation helpful? Give feedback.
-
I implemented a basic version of this idea. If you're interested I can add some docs and make a proper PR. Are there other code paths that should have this logic? In my experience it seems quite effective with just one retry. It saves a good handful of iterations from polluting the database with uninteresting failures. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm not sure how common this is, but in the experiments I've been doing (relatively big program, mostly using Gemini 2.5 pro and Claude 4 Opus), in most iterations the LLM gives back a program that at least works. But non-functional programs are somewhat common too (using nonexistent API, syntax errors). As I understand it, the common practice in this case is to give the iteration some made up low
combined_score
. That's ok, but it seems a bit of a waste. I wonder if anyone else thinks it might be useful to build in an additional special case iteration with the LLM where a prompt is generated that says something like "You gave me this code XXX, but it failed with this error YYY. Try to fix the error without significant logic changes".Beta Was this translation helpful? Give feedback.
All reactions