feat(ai): cloud code agent evals by bartolomej · Pull Request #62 · lightning-rod-labs/lightningrod-python-sdk

bartolomej · 2026-04-24T17:19:17Z

Evaluation framework (experimental)

An eval suite of 14 Harbor tasks tests the agent for data quality awareness (survivorship bias, temporal leakage, stale data), cost transparency, and correct SDK usage patterns. An LLM-as-judge scores each response on a weighted rubric. A self-improvement loop runs evals, edits the agent prompt, and re-runs to measure impact.

Note: The eval infrastructure is a work in progress and results are not yet stable.

make eval-all    # run all 14 tasks
make autoagent   # run the self-improvement loop

bartolomej · 2026-04-24T17:19:33Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

feat(ai): cloud code agent evals #62 👈 (View in Graphite)
feat(ai): claude code agent #44
feat(training): update fine tuning examples to our API, linting v1 #43
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Add agent evals (Harbor + AutoAgent)

43b44fb

This was referenced Apr 24, 2026

feat(ai): claude code agent #44

Merged

feat(training): update fine tuning examples to our API, linting v1 #43

Merged

bartolomej changed the title ~~Add agent evals (Harbor + AutoAgent)~~ feat(ai): cloud code agent evals Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai): cloud code agent evals#62

feat(ai): cloud code agent evals#62
bartolomej wants to merge 1 commit into
bart/sdk-agentfrom
bart/sdk-agent-evals

bartolomej commented Apr 24, 2026 •

edited

Loading

Uh oh!

bartolomej commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bartolomej commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evaluation framework (experimental)

Uh oh!

bartolomej commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bartolomej commented Apr 24, 2026 •

edited

Loading