Brainstorm Benchmark-Based Roo-Tuning #1821

KJ7LNW · 2025-03-20T01:30:42Z

KJ7LNW
Mar 20, 2025

Roo has benchmarks! See #1614 for implementation details.

The discussion below is related to #1614, but focused on brainstorming ideas of how @cte's benchmark tooling can be used to make Roo Code even better:

Check out the initial results:

@cte wrote:

For the full [benchmark] run I performed over the weekend the results were:

Benchmark: Aider Polyglot
Model: Claude 3.7 Sonnet
Thinking / Planning: No
Roo Code: Percent Correct -> 91.8%, Cost -> $36.57
Aider: Percent Correct -> 60.4%, Cost -> $17.72

To get this conversation started, here are some ideas:

Per-model instruction tuning, and ultimately per model tool instructions. For example, the instructions that Qwen needs in order to be successful in using each tool will probably be different than smarter models like Sonnet. Even among the "smart" models, many still have trouble using the tools properly. We may discover through benchmarking what model-specific instructions each model requires for each tool.
Iterative tuning for:
1. minimum context
2. minimum cost
3. other useful metrics?
Automatic system instruction refining through benchmark iteration. One example might be a system instruction like "before using a tool, describe your intention in the way that you would explain it to a 12-year-old". Then multiple benchmarks can be run and the age of explanation it can be adjusted. The idea is that as the model adds explanation context to the conversation history toward reaching its goal, the meaning of its own words will guide it in successful creation of that goal. It is entirely possible that language targeted at different human development ages may perform better or worse than others.
- What are other simple system instructions that could be added to the benchmark process in such a way that the new instruction, itself, can be tuned by making a simple change and inspecting the result?
Benchmark analysis: entire benchmark outputs can be loaded into large context models like Gemini, and then questions can be posed like:
1. What were common errors?
2. In generic terms not specific to the task, what are additional instructions the model may have been needed to be more successful?
3. etc.

Orchestrator-based problem-solving: @hannesrudolph provided me with the following orchestration mode, and it is quite useful in organizing long term goals by punting individual tasks to new_task workers; it would be very interesting to find out if orchestration-based benchmarking is more successful than single-task benchmarks:

  "customModes": [
	{
	  "slug": "orchestrator",
	  "name": "Orchestrator",
	  "roleDefinition": "You are Roo, a strategic workflow orchestrator who coordinates complex tasks by delegating them to appropriate specialized modes. You have a comprehensive understanding of each mode's capabilities and limitations, allowing you to effectively break down complex problems into discrete tasks that can be solved by different specialists.",
	  "customInstructions": "Your role is to coordinate complex workflows by delegating tasks to specialized modes. As an orchestrator, you should:\n\n1. When given a complex task, break it down into logical subtasks that can be delegated to appropriate specialized modes.\n\n2. For each subtask, create a new task with a clear, specific instruction using the new_task tool. Choose the most appropriate mode for each task based on its nature and requirements.\n\n3. Track and manage the progress of all subtasks. When a subtask is completed, analyze its results and determine the next steps.\n\n4. Help the user understand how the different subtasks fit together in the overall workflow. Provide clear reasoning about why you're delegating specific tasks to specific modes.\n\n5. When all subtasks are completed, synthesize the results and provide a comprehensive overview of what was accomplished.\n\n6. You can also manage custom modes by editing cline_custom_modes.json and .roomodes files directly. This allows you to create, modify, or delete custom modes as part of your orchestration capabilities.\n\n7. Ask clarifying questions when necessary to better understand how to break down complex tasks effectively.\n\n8. Suggest improvements to the workflow based on the results of completed subtasks.\n\n9. When providing instructions to a new task, be very detailed: use at least 100 lines of information",
	  "groups": [
	    "read",
	    [
	      "edit",
	      {
	        "fileRegex": "\\.roomodes$|cline_custom_modes\\.json$",
	        "description": "Mode configuration files only"
	      }
	    ]
	  ],
	  "source": "global"
	} 
]

Release performance regression tracking (like Phoronix does for the Linux kernel)

Please share your thoughts and ideas on how benchmarking can be used to make Roo Code an even better tool!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Brainstorm Benchmark-Based Roo-Tuning #1821

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Brainstorm Benchmark-Based Roo-Tuning #1821

Uh oh!

Uh oh!

KJ7LNW Mar 20, 2025

Roo has benchmarks! See #1614 for implementation details.

Check out the initial results:

To get this conversation started, here are some ideas:

Replies: 0 comments

KJ7LNW
Mar 20, 2025