Skip to content

Conversation

@nerdsane
Copy link
Contributor

@nerdsane nerdsane commented Jun 3, 2025

Why

AlphaEvolve injects “rendered evaluation results – usually a program, the result of executing that program, and the scores assigned by the ­evaluate function” into every prompt, arguing that rich execution feedback speeds convergence.
OpenEvolve presently forwards only the metrics dictionary with numeric values, so LLMs never see build logs, failing-test traces, perf profiles or any other output that may be useful.

In an example I'm currently experimenting with I used this to pass formal model (TLA+) check output and memory profiling results to the LLM. (This is currently work in progress and not included in this PR, so I added the circle packing example with slightly updated evaluator to show how artifacts work).

What’s in this PR

  • EvaluationResult dataclass — retains the original metrics dict and adds an optional artifacts field for text / binary payloads.
  • Two-tier storage — artifacts ≤ 32 KB are JSON-encoded in a new artifacts_json column; larger blobs are written under artifact_dir/ on disk and referenced from the DB.
  • Prompt support — templates now accept {artifacts}; the sampler injects sanitized, size-capped content so the LLM sees exact failure text without blowing context.
  • Config & env flagsENABLE_ARTIFACTS, max_artifact_bytes, and base-path knobs let users toggle or tune the feature with zero code changes.
  • Examples & tests — updated circle-packing example shows compile-failure recovery; 26 new tests cover unit, integration, and perf to keep coverage green.
  • Backward compatibility — plain metrics dicts are auto-wrapped, so existing tasks run unmodified.
  • No impact on selection logic — Best program ranking is still based on pure floats; artifacts are for the LLM only.

How it works

  1. Evaluator returns

    return EvaluationResult(
        metrics={"build_ok": 0.0},
        artifacts={"stderr": compile_log}
    )
  2. DB stores metrics + artifact blob.

  3. Prompt sampler tacks on a block like

    ### Last-run stderr
    ...undefined reference to `foo`...
    

    giving the LLM concrete tokens to fix next round while selection logic still ranks on pure floats.

Impact

  • Richer context – faster self-repair of broken candidates, ability to steer the LLM with additional data.
  • Zero breaking changes – old evaluators are auto-wrapped; artifacts pipeline is fully disable-able.

@CLAassistant
Copy link

CLAassistant commented Jun 3, 2025

CLA assistant check
All committers have signed the CLA.

@nerdsane nerdsane changed the title Feat artifact side channel Feature: Artifact side channel Jun 3, 2025
@codelion
Copy link
Member

codelion commented Jun 4, 2025

This PR addresses #37 as well I believe.

@codelion
Copy link
Member

codelion commented Jun 4, 2025

Can you please update from main, we merged a couple of PRs that were already in testing.

@nerdsane
Copy link
Contributor Author

nerdsane commented Jun 4, 2025

I resolved the merge conflicts that came with the changes from PRs #47 and #54, and cleaned up a unit test had a Python version-specific assertion, so tests should now pass.

This PR addresses #37 as well I believe.

Yes, any evaluation errors can now be included in the prompt.

@codelion codelion merged commit c779ac9 into algorithmicsuperintelligence:main Jun 9, 2025
3 checks passed
@nileshtrivedi
Copy link

It might be good to document the convention for artifact names? For eg: build_stdout, build_stderr, run_stdout, run_stderr etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants