Replies: 1 comment
-
|
From my point of view, the taxonomy question is one of the most valuable parts of this proposal because it changes evals from a pile of scenarios into something maintainers can actually reason about over time. I would still keep the first expansion tied to real workflow categories like debugging, review, and multi step editing before pushing too far into cross agent benchmarking. Once the suite describes the product well, the cross agent comparison story becomes much more credible. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm Parin a student at CMU studying AI. I'm interested in the Behavioral Evaluation Test Framework GSoC project.
I've been reading through the existing
evals/infrastructure and the README. TheevalTest+ policy system and the/fix-behavioral-evalself-healing loop are really well-designed. I have a few questions about the GSoC project's intended scope:For context, my research involves designing cross-lingual evaluation benchmarks for multimodal LLMs, I've published on evaluation methodology as well so I'm pretty familiar with the entire benchmarking / evaluating LLMs. I'm particularly interested in adding long tail tasks so we could also measure behavior changes like "How well does the agent navigate larger codebases?" or other more complex scenarios.
I'm planning to submit a PR adding behavioral evals for currently-untested tools as a starting contribution. Happy to align with whatever areas are highest priority!
Best Regards,
Parinthapat Pengpun
cc: @gundermanc
Beta Was this translation helpful? Give feedback.
All reactions