Skip to content

Commit 3f2c7a8

Browse files
committed
Incorporate evals changes from fixing-evals
1 parent 1504beb commit 3f2c7a8

File tree

27 files changed

+3900
-295
lines changed

27 files changed

+3900
-295
lines changed

evals/README.md

Lines changed: 295 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,303 @@
1-
# Run Roo Code Evals
1+
# Roo Code Evals
22

3-
## Get Started
3+
A comprehensive framework for evaluating the performance of the Roo Code extension on programming exercises across multiple languages.
44

5-
NOTE: This is MacOS only for now!
5+
## Table of Contents
66

7-
Clone the Roo Code repo:
7+
- [Overview](#overview)
8+
- [Architecture](#architecture)
9+
- [Setup Process](#setup-process)
10+
- [First-Time Setup](#first-time-setup)
11+
- [Already Installed](#already-installed)
12+
- [Running Evaluations](#running-evaluations)
13+
- [Using the Web UI](#using-the-web-ui)
14+
- [Using the CLI](#using-the-cli)
15+
- [Understanding Results](#understanding-results)
16+
- [Project Structure](#project-structure)
17+
- [Troubleshooting](#troubleshooting)
18+
- [Advanced Usage](#advanced-usage)
819

9-
```sh
10-
git clone https://github.com/RooVetGit/Roo-Code.git
11-
cd Roo-Code
12-
```
20+
## Overview
21+
22+
Roo Code Evals is a system for running automated evaluations of the Roo Code extension against programming exercises in various languages. It helps measure the performance, accuracy, and efficiency of the AI coding assistant across different programming tasks.
23+
24+
The system supports evaluations in:
25+
26+
- Go
27+
- Java
28+
- JavaScript
29+
- Python
30+
- Rust
31+
32+
## Architecture
33+
34+
The evals system consists of several interconnected components:
35+
36+
1. **CLI Application** (`apps/cli`):
37+
38+
- Launches VS Code instances with the Roo Code extension
39+
- Sends prompts to the extension
40+
- Collects results and metrics
41+
- Runs unit tests to verify solutions
42+
43+
2. **Web UI** (`apps/web`):
44+
45+
- Next.js application for viewing and managing evaluation runs
46+
- Displays detailed metrics and results
47+
- Allows creating new evaluation runs with custom settings
48+
49+
3. **Database** (`packages/db`):
50+
51+
- SQLite database for storing evaluation data
52+
- Tracks runs, tasks, metrics, and errors
53+
- Located at `/tmp/evals.db` by default
54+
55+
4. **IPC System** (`packages/ipc`):
56+
57+
- Facilitates communication between processes
58+
- Uses Unix sockets for local communication
59+
60+
5. **Exercises Repository**:
61+
- Separate Git repository containing exercise prompts and test cases
62+
- Cloned during setup to a location outside the main Roo Code repository
63+
- Located at `../../evals` relative to the Roo Code repository
64+
65+
## Setup Process
66+
67+
### First-Time Setup
68+
69+
If you're setting up the evals system for the first time:
70+
71+
1. **Clone the Roo Code repository**:
72+
73+
```sh
74+
git clone https://github.com/RooVetGit/Roo-Code.git
75+
cd Roo-Code
76+
```
77+
78+
2. **Run the setup script**:
79+
80+
```sh
81+
cd evals
82+
./scripts/setup.sh
83+
```
84+
85+
This script will:
86+
87+
- Check for Node.js v20.18.1+ (and install/configure it if needed)
88+
- Launch the interactive setup.mjs script which will:
89+
- Check for and install required dependencies (Python, Go, Rust, Java)
90+
- Install package managers (pnpm, asdf, Homebrew if on macOS)
91+
- Clone the exercises repository to `../../evals`
92+
- Set up the .env file with your OpenRouter API key
93+
- Create and sync the database
94+
- Build and install the Roo Code extension if needed
95+
- Install required VS Code extensions
96+
- Offer to start the web UI immediately
97+
98+
3. **If you chose to start the web UI during setup**:
99+
- The web server will be running at http://localhost:3000
100+
- You can access it in your browser to create and view evaluation runs
101+
102+
### Already Installed
103+
104+
If you've already completed the setup process and want to run the evals system again:
105+
106+
1. **Navigate to the evals directory**:
107+
108+
```sh
109+
cd /path/to/Roo-Code/evals
110+
```
111+
112+
2. **Start the web UI**:
113+
114+
```sh
115+
pnpm web
116+
```
117+
118+
3. **Access the web interface**:
119+
- Open http://localhost:3000 in your browser
120+
121+
## Running Evaluations
122+
123+
### Using the Web UI
124+
125+
1. **Start the web UI** (if not already running):
126+
127+
```sh
128+
cd /path/to/Roo-Code/evals
129+
pnpm web
130+
```
131+
132+
2. **Create a new evaluation run**:
133+
134+
- Click the "New Evaluation Run" button (rocket icon)
135+
- Select a model from the OpenRouter models list
136+
- Choose evaluation settings:
137+
- **All**: Run all exercises for all languages
138+
- **Some**: Select specific language/exercise combinations
139+
- Set concurrency level (how many evaluations to run in parallel)
140+
- Add an optional description
141+
- Click "Launch" to start the evaluation
142+
143+
3. **Monitor progress**:
144+
145+
- The run details page shows real-time progress of each task
146+
- Metrics are updated as tasks complete
147+
- You can navigate away and come back later - the evaluation continues in the background
148+
149+
4. **View results**:
150+
151+
- When tasks complete, you'll see pass/fail status
152+
- Detailed metrics include:
153+
- Token usage (input, output, context)
154+
- Cost
155+
- Duration
156+
- Tool usage statistics
157+
158+
5. **Export results**:
159+
- Use the "Export CSV" button to download results for further analysis
13160

14-
Run the setup script:
161+
### Using the CLI
162+
163+
For more advanced usage or automation, you can use the CLI directly:
164+
165+
1. **Run a specific exercise**:
166+
167+
```sh
168+
cd /path/to/Roo-Code/evals
169+
pnpm cli run javascript fibonacci
170+
```
171+
172+
2. **Run all exercises for a language**:
173+
174+
```sh
175+
pnpm cli run python all
176+
```
177+
178+
3. **Run all exercises for all languages**:
179+
180+
```sh
181+
pnpm cli run all
182+
```
183+
184+
4. **Resume a previous run**:
185+
```sh
186+
pnpm cli run --runId=123
187+
```
188+
189+
## Understanding Results
190+
191+
The evaluation results provide several key metrics:
192+
193+
- **Pass/Fail Status**: Whether the solution passed the unit tests
194+
- **Token Usage**:
195+
- Tokens In: Prompt size in tokens
196+
- Tokens Out: Response size in tokens
197+
- Context Tokens: Size of context window used
198+
- **Cost**: Estimated cost of the API calls
199+
- **Duration**: Time taken to complete the task
200+
- **Tool Usage**: Statistics on tool usage (e.g., apply_diff success rate)
201+
202+
These metrics help you understand:
203+
204+
- How effectively the model solves different types of problems
205+
- The efficiency of the solution process
206+
- Cost implications of different models and settings
207+
208+
## Project Structure
15209

16-
```sh
17-
cd evals
18-
./scripts/setup.sh
19210
```
211+
evals/
212+
├── apps/
213+
│ ├── cli/ # Command-line interface for running evaluations
214+
│ │ └── src/ # CLI source code
215+
│ └── web/ # Web interface for viewing results
216+
│ └── src/ # Next.js web application
217+
├── packages/
218+
│ ├── db/ # Database schema and queries
219+
│ ├── ipc/ # Inter-process communication
220+
│ ├── lib/ # Shared utilities
221+
│ └── types/ # TypeScript type definitions
222+
├── scripts/ # Setup and utility scripts
223+
│ ├── setup.sh # Main setup script (entry point)
224+
│ └── setup.mjs # Node.js setup script (called by setup.sh)
225+
└── README.md # This file
226+
```
227+
228+
The exercises repository (cloned during setup) is located at `../../evals` relative to this directory.
229+
230+
## Troubleshooting
231+
232+
### Common Issues
233+
234+
1. **Web UI not starting**:
235+
236+
- Check if another process is using port 3000
237+
- Ensure you're in the correct directory (`/path/to/Roo-Code/evals`)
238+
- Try running `pnpm install` to ensure dependencies are installed
239+
240+
2. **VS Code not launching during evaluation**:
241+
242+
- Ensure VS Code is installed and the `code` command is in your PATH
243+
- Check if VS Code is already running with too many windows
244+
- Try restarting VS Code
245+
246+
3. **Database errors**:
247+
248+
- Check that the database file exists at `/tmp/evals.db`
249+
- Ensure it has the correct permissions
250+
- Try running `pnpm --filter @evals/db db:push` to recreate the schema
251+
252+
4. **API key issues**:
253+
254+
- Verify your OpenRouter API key is correctly set in the `.env` file
255+
- Check if the key has sufficient credits
256+
- Try validating the key with a direct API call
257+
258+
5. **Missing exercises**:
259+
- Ensure the exercises repository was cloned correctly to `../../evals`
260+
- Check if you have the latest version with `git -C ../../evals pull`
261+
262+
### Logs
263+
264+
When troubleshooting, check these logs:
265+
266+
- **CLI logs**: Output in the terminal where you ran the CLI
267+
- **Web logs**:
268+
- Server logs in the terminal where you ran `pnpm web`
269+
- Browser console logs (F12 in most browsers)
270+
- **VS Code logs**: Help > Toggle Developer Tools in VS Code
271+
272+
## Advanced Usage
273+
274+
### Custom Settings
275+
276+
You can import custom Roo Code settings when creating a new evaluation run:
277+
278+
1. Export your settings from VS Code
279+
2. Click "Import Settings" in the new run page
280+
3. Select your exported settings file
281+
282+
### Running Specific Exercises
283+
284+
To run only specific exercises:
285+
286+
1. In the web UI, select "Some" instead of "All" in the exercises dropdown
287+
2. Select the specific language/exercise combinations you want to evaluate
288+
289+
### Comparing Models
290+
291+
To compare different models:
292+
293+
1. Create separate runs with different models
294+
2. View the results side by side
295+
3. Export the results to CSV for detailed comparison
296+
297+
### Modifying Exercises
298+
299+
If you want to create or modify exercises:
20300
21-
Navigate to [localhost:3000](http://localhost:3000/) in your browser.
301+
1. Navigate to the exercises repository (`../../evals`)
302+
2. Add or modify exercises following the existing structure
303+
3. Commit and push your changes if you want to share them

0 commit comments

Comments
 (0)