Brainstorm Benchmark-Based Roo-Tuning #1821
KJ7LNW
started this conversation in
Feature Requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Roo has benchmarks! See #1614 for implementation details.
The discussion below is related to #1614, but focused on brainstorming ideas of how @cte's benchmark tooling can be used to make Roo Code even better:
Check out the initial results:
@cte wrote:
To get this conversation started, here are some ideas:
Per-model instruction tuning, and ultimately per model tool instructions. For example, the instructions that Qwen needs in order to be successful in using each tool will probably be different than smarter models like Sonnet. Even among the "smart" models, many still have trouble using the tools properly. We may discover through benchmarking what model-specific instructions each model requires for each tool.
Iterative tuning for:
Automatic system instruction refining through benchmark iteration. One example might be a system instruction like "before using a tool, describe your intention in the way that you would explain it to a 12-year-old". Then multiple benchmarks can be run and the age of explanation it can be adjusted. The idea is that as the model adds explanation context to the conversation history toward reaching its goal, the meaning of its own words will guide it in successful creation of that goal. It is entirely possible that language targeted at different human development ages may perform better or worse than others.
Benchmark analysis: entire benchmark outputs can be loaded into large context models like Gemini, and then questions can be posed like:
Orchestrator-based problem-solving: @hannesrudolph provided me with the following orchestration mode, and it is quite useful in organizing long term goals by punting individual tasks to
new_task
workers; it would be very interesting to find out if orchestration-based benchmarking is more successful than single-task benchmarks:Release performance regression tracking (like Phoronix does for the Linux kernel)
Please share your thoughts and ideas on how benchmarking can be used to make Roo Code an even better tool!
Beta Was this translation helpful? Give feedback.
All reactions