|
1 | | -# Run Roo Code Evals |
| 1 | +# Roo Code Evals |
2 | 2 |
|
3 | | -## Get Started |
| 3 | +A comprehensive framework for evaluating the performance of the Roo Code extension on programming exercises across multiple languages. |
4 | 4 |
|
5 | | -NOTE: This is MacOS only for now! |
| 5 | +## Table of Contents |
6 | 6 |
|
7 | | -Clone the Roo Code repo: |
| 7 | +- [Overview](#overview) |
| 8 | +- [Architecture](#architecture) |
| 9 | +- [Setup Process](#setup-process) |
| 10 | + - [First-Time Setup](#first-time-setup) |
| 11 | + - [Already Installed](#already-installed) |
| 12 | +- [Running Evaluations](#running-evaluations) |
| 13 | + - [Using the Web UI](#using-the-web-ui) |
| 14 | + - [Using the CLI](#using-the-cli) |
| 15 | +- [Understanding Results](#understanding-results) |
| 16 | +- [Project Structure](#project-structure) |
| 17 | +- [Troubleshooting](#troubleshooting) |
| 18 | +- [Advanced Usage](#advanced-usage) |
8 | 19 |
|
9 | | -```sh |
10 | | -git clone https://github.com/RooVetGit/Roo-Code.git |
11 | | -cd Roo-Code |
12 | | -``` |
| 20 | +## Overview |
| 21 | + |
| 22 | +Roo Code Evals is a system for running automated evaluations of the Roo Code extension against programming exercises in various languages. It helps measure the performance, accuracy, and efficiency of the AI coding assistant across different programming tasks. |
| 23 | + |
| 24 | +The system supports evaluations in: |
| 25 | + |
| 26 | +- Go |
| 27 | +- Java |
| 28 | +- JavaScript |
| 29 | +- Python |
| 30 | +- Rust |
| 31 | + |
| 32 | +## Architecture |
| 33 | + |
| 34 | +The evals system consists of several interconnected components: |
| 35 | + |
| 36 | +1. **CLI Application** (`apps/cli`): |
| 37 | + |
| 38 | + - Launches VS Code instances with the Roo Code extension |
| 39 | + - Sends prompts to the extension |
| 40 | + - Collects results and metrics |
| 41 | + - Runs unit tests to verify solutions |
| 42 | + |
| 43 | +2. **Web UI** (`apps/web`): |
| 44 | + |
| 45 | + - Next.js application for viewing and managing evaluation runs |
| 46 | + - Displays detailed metrics and results |
| 47 | + - Allows creating new evaluation runs with custom settings |
| 48 | + |
| 49 | +3. **Database** (`packages/db`): |
| 50 | + |
| 51 | + - SQLite database for storing evaluation data |
| 52 | + - Tracks runs, tasks, metrics, and errors |
| 53 | + - Located at `/tmp/evals.db` by default |
| 54 | + |
| 55 | +4. **IPC System** (`packages/ipc`): |
| 56 | + |
| 57 | + - Facilitates communication between processes |
| 58 | + - Uses Unix sockets for local communication |
| 59 | + |
| 60 | +5. **Exercises Repository**: |
| 61 | + - Separate Git repository containing exercise prompts and test cases |
| 62 | + - Cloned during setup to a location outside the main Roo Code repository |
| 63 | + - Located at `../../evals` relative to the Roo Code repository |
| 64 | + |
| 65 | +## Setup Process |
| 66 | + |
| 67 | +### First-Time Setup |
| 68 | + |
| 69 | +If you're setting up the evals system for the first time: |
| 70 | + |
| 71 | +1. **Clone the Roo Code repository**: |
| 72 | + |
| 73 | + ```sh |
| 74 | + git clone https://github.com/RooVetGit/Roo-Code.git |
| 75 | + cd Roo-Code |
| 76 | + ``` |
| 77 | + |
| 78 | +2. **Run the setup script**: |
| 79 | + |
| 80 | + ```sh |
| 81 | + cd evals |
| 82 | + ./scripts/setup.sh |
| 83 | + ``` |
| 84 | + |
| 85 | + This script will: |
| 86 | + |
| 87 | + - Check for Node.js v20.18.1+ (and install/configure it if needed) |
| 88 | + - Launch the interactive setup.mjs script which will: |
| 89 | + - Check for and install required dependencies (Python, Go, Rust, Java) |
| 90 | + - Install package managers (pnpm, asdf, Homebrew if on macOS) |
| 91 | + - Clone the exercises repository to `../../evals` |
| 92 | + - Set up the .env file with your OpenRouter API key |
| 93 | + - Create and sync the database |
| 94 | + - Build and install the Roo Code extension if needed |
| 95 | + - Install required VS Code extensions |
| 96 | + - Offer to start the web UI immediately |
| 97 | + |
| 98 | +3. **If you chose to start the web UI during setup**: |
| 99 | + - The web server will be running at http://localhost:3000 |
| 100 | + - You can access it in your browser to create and view evaluation runs |
| 101 | + |
| 102 | +### Already Installed |
| 103 | + |
| 104 | +If you've already completed the setup process and want to run the evals system again: |
| 105 | +
|
| 106 | +1. **Navigate to the evals directory**: |
| 107 | +
|
| 108 | + ```sh |
| 109 | + cd /path/to/Roo-Code/evals |
| 110 | + ``` |
| 111 | +
|
| 112 | +2. **Start the web UI**: |
| 113 | +
|
| 114 | + ```sh |
| 115 | + pnpm web |
| 116 | + ``` |
| 117 | +
|
| 118 | +3. **Access the web interface**: |
| 119 | + - Open http://localhost:3000 in your browser |
| 120 | +
|
| 121 | +## Running Evaluations |
| 122 | +
|
| 123 | +### Using the Web UI |
| 124 | +
|
| 125 | +1. **Start the web UI** (if not already running): |
| 126 | +
|
| 127 | + ```sh |
| 128 | + cd /path/to/Roo-Code/evals |
| 129 | + pnpm web |
| 130 | + ``` |
| 131 | +
|
| 132 | +2. **Create a new evaluation run**: |
| 133 | +
|
| 134 | + - Click the "New Evaluation Run" button (rocket icon) |
| 135 | + - Select a model from the OpenRouter models list |
| 136 | + - Choose evaluation settings: |
| 137 | + - **All**: Run all exercises for all languages |
| 138 | + - **Some**: Select specific language/exercise combinations |
| 139 | + - Set concurrency level (how many evaluations to run in parallel) |
| 140 | + - Add an optional description |
| 141 | + - Click "Launch" to start the evaluation |
| 142 | +
|
| 143 | +3. **Monitor progress**: |
| 144 | +
|
| 145 | + - The run details page shows real-time progress of each task |
| 146 | + - Metrics are updated as tasks complete |
| 147 | + - You can navigate away and come back later - the evaluation continues in the background |
| 148 | +
|
| 149 | +4. **View results**: |
| 150 | +
|
| 151 | + - When tasks complete, you'll see pass/fail status |
| 152 | + - Detailed metrics include: |
| 153 | + - Token usage (input, output, context) |
| 154 | + - Cost |
| 155 | + - Duration |
| 156 | + - Tool usage statistics |
| 157 | + |
| 158 | +5. **Export results**: |
| 159 | + - Use the "Export CSV" button to download results for further analysis |
13 | 160 |
|
14 | | -Run the setup script: |
| 161 | +### Using the CLI |
| 162 | + |
| 163 | +For more advanced usage or automation, you can use the CLI directly: |
| 164 | + |
| 165 | +1. **Run a specific exercise**: |
| 166 | + |
| 167 | + ```sh |
| 168 | + cd /path/to/Roo-Code/evals |
| 169 | + pnpm cli run javascript fibonacci |
| 170 | + ``` |
| 171 | + |
| 172 | +2. **Run all exercises for a language**: |
| 173 | + |
| 174 | + ```sh |
| 175 | + pnpm cli run python all |
| 176 | + ``` |
| 177 | + |
| 178 | +3. **Run all exercises for all languages**: |
| 179 | + |
| 180 | + ```sh |
| 181 | + pnpm cli run all |
| 182 | + ``` |
| 183 | + |
| 184 | +4. **Resume a previous run**: |
| 185 | + ```sh |
| 186 | + pnpm cli run --runId=123 |
| 187 | + ``` |
| 188 | + |
| 189 | +## Understanding Results |
| 190 | + |
| 191 | +The evaluation results provide several key metrics: |
| 192 | + |
| 193 | +- **Pass/Fail Status**: Whether the solution passed the unit tests |
| 194 | +- **Token Usage**: |
| 195 | + - Tokens In: Prompt size in tokens |
| 196 | + - Tokens Out: Response size in tokens |
| 197 | + - Context Tokens: Size of context window used |
| 198 | +- **Cost**: Estimated cost of the API calls |
| 199 | +- **Duration**: Time taken to complete the task |
| 200 | +- **Tool Usage**: Statistics on tool usage (e.g., apply_diff success rate) |
| 201 | + |
| 202 | +These metrics help you understand: |
| 203 | + |
| 204 | +- How effectively the model solves different types of problems |
| 205 | +- The efficiency of the solution process |
| 206 | +- Cost implications of different models and settings |
| 207 | + |
| 208 | +## Project Structure |
15 | 209 |
|
16 | | -```sh |
17 | | -cd evals |
18 | | -./scripts/setup.sh |
19 | 210 | ``` |
| 211 | +evals/ |
| 212 | +├── apps/ |
| 213 | +│ ├── cli/ # Command-line interface for running evaluations |
| 214 | +│ │ └── src/ # CLI source code |
| 215 | +│ └── web/ # Web interface for viewing results |
| 216 | +│ └── src/ # Next.js web application |
| 217 | +├── packages/ |
| 218 | +│ ├── db/ # Database schema and queries |
| 219 | +│ ├── ipc/ # Inter-process communication |
| 220 | +│ ├── lib/ # Shared utilities |
| 221 | +│ └── types/ # TypeScript type definitions |
| 222 | +├── scripts/ # Setup and utility scripts |
| 223 | +│ ├── setup.sh # Main setup script (entry point) |
| 224 | +│ └── setup.mjs # Node.js setup script (called by setup.sh) |
| 225 | +└── README.md # This file |
| 226 | +``` |
| 227 | + |
| 228 | +The exercises repository (cloned during setup) is located at `../../evals` relative to this directory. |
| 229 | + |
| 230 | +## Troubleshooting |
| 231 | + |
| 232 | +### Common Issues |
| 233 | + |
| 234 | +1. **Web UI not starting**: |
| 235 | + |
| 236 | + - Check if another process is using port 3000 |
| 237 | + - Ensure you're in the correct directory (`/path/to/Roo-Code/evals`) |
| 238 | + - Try running `pnpm install` to ensure dependencies are installed |
| 239 | +
|
| 240 | +2. **VS Code not launching during evaluation**: |
| 241 | +
|
| 242 | + - Ensure VS Code is installed and the `code` command is in your PATH |
| 243 | + - Check if VS Code is already running with too many windows |
| 244 | + - Try restarting VS Code |
| 245 | +
|
| 246 | +3. **Database errors**: |
| 247 | +
|
| 248 | + - Check that the database file exists at `/tmp/evals.db` |
| 249 | + - Ensure it has the correct permissions |
| 250 | + - Try running `pnpm --filter @evals/db db:push` to recreate the schema |
| 251 | +
|
| 252 | +4. **API key issues**: |
| 253 | +
|
| 254 | + - Verify your OpenRouter API key is correctly set in the `.env` file |
| 255 | + - Check if the key has sufficient credits |
| 256 | + - Try validating the key with a direct API call |
| 257 | +
|
| 258 | +5. **Missing exercises**: |
| 259 | + - Ensure the exercises repository was cloned correctly to `../../evals` |
| 260 | + - Check if you have the latest version with `git -C ../../evals pull` |
| 261 | +
|
| 262 | +### Logs |
| 263 | +
|
| 264 | +When troubleshooting, check these logs: |
| 265 | +
|
| 266 | +- **CLI logs**: Output in the terminal where you ran the CLI |
| 267 | +- **Web logs**: |
| 268 | + - Server logs in the terminal where you ran `pnpm web` |
| 269 | + - Browser console logs (F12 in most browsers) |
| 270 | +- **VS Code logs**: Help > Toggle Developer Tools in VS Code |
| 271 | +
|
| 272 | +## Advanced Usage |
| 273 | +
|
| 274 | +### Custom Settings |
| 275 | +
|
| 276 | +You can import custom Roo Code settings when creating a new evaluation run: |
| 277 | +
|
| 278 | +1. Export your settings from VS Code |
| 279 | +2. Click "Import Settings" in the new run page |
| 280 | +3. Select your exported settings file |
| 281 | +
|
| 282 | +### Running Specific Exercises |
| 283 | +
|
| 284 | +To run only specific exercises: |
| 285 | +
|
| 286 | +1. In the web UI, select "Some" instead of "All" in the exercises dropdown |
| 287 | +2. Select the specific language/exercise combinations you want to evaluate |
| 288 | +
|
| 289 | +### Comparing Models |
| 290 | +
|
| 291 | +To compare different models: |
| 292 | +
|
| 293 | +1. Create separate runs with different models |
| 294 | +2. View the results side by side |
| 295 | +3. Export the results to CSV for detailed comparison |
| 296 | +
|
| 297 | +### Modifying Exercises |
| 298 | +
|
| 299 | +If you want to create or modify exercises: |
20 | 300 |
|
21 | | -Navigate to [localhost:3000](http://localhost:3000/) in your browser. |
| 301 | +1. Navigate to the exercises repository (`../../evals`) |
| 302 | +2. Add or modify exercises following the existing structure |
| 303 | +3. Commit and push your changes if you want to share them |
0 commit comments