Eval Framework? #20
Unanswered
stobias123
asked this question in
Q&A
Replies: 1 comment
-
I have been building the GAIA and SWE-lite benchmark runners to evaluate the agents overall ability. As for individual prompts, they just been manually tweaked so far from observations. Evaluation and meta-prompting is an area I'm looking into, DSPy etc to get a feel of what exists and what would be most suitable for integrating/building. I have a few ideas I'd like to play with so building evaluation datasets will be an important part. In SWE-bench there is an oracle dataset which has the files that need to be edited, so that gives a dataset for evaluating functionality in the selectFilesToEdit.ts file |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Doing some tinkering and absolutely love it so far. Trying to learn more and I'm curious if you guys have any docs on how you evaluate new functions / agents.
Beta Was this translation helpful? Give feedback.
All reactions