Skip to content

feat(ai): cloud code agent evals#62

Draft
bartolomej wants to merge 1 commit into
bart/sdk-agentfrom
bart/sdk-agent-evals
Draft

feat(ai): cloud code agent evals#62
bartolomej wants to merge 1 commit into
bart/sdk-agentfrom
bart/sdk-agent-evals

Conversation

@bartolomej
Copy link
Copy Markdown
Collaborator

@bartolomej bartolomej commented Apr 24, 2026

Evaluation framework (experimental)

An eval suite of 14 Harbor tasks tests the agent for data quality awareness (survivorship bias, temporal leakage, stale data), cost transparency, and correct SDK usage patterns. An LLM-as-judge scores each response on a weighted rubric. A self-improvement loop runs evals, edits the agent prompt, and re-runs to measure impact.

Note: The eval infrastructure is a work in progress and results are not yet stable.

make eval-all    # run all 14 tasks
make autoagent   # run the self-improvement loop

Copy link
Copy Markdown
Collaborator Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@bartolomej bartolomej changed the title Add agent evals (Harbor + AutoAgent) feat(ai): cloud code agent evals Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant