Skip to content

Commit 2bf4347

Browse files
authored
Document how to add evals (RooCodeInc#4470)
1 parent 73ed9f2 commit 2bf4347

File tree

1 file changed

+305
-0
lines changed

1 file changed

+305
-0
lines changed

packages/evals/ADDING-EVALS.md

Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
# Adding Additional Evals Exercises
2+
3+
This guide explains how to add new coding exercises to the Roo Code evals system. The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments to test AI coding capabilities across multiple programming languages.
4+
5+
## Table of Contents
6+
7+
1. [What is an "Eval"?](#what-is-an-eval)
8+
2. [System Overview](#system-overview)
9+
3. [Adding Exercises to Existing Languages](#adding-exercises-to-existing-languages)
10+
4. [Adding Support for New Programming Languages](#adding-support-for-new-programming-languages)
11+
12+
## What is an "Eval"?
13+
14+
An **eval** (evaluation) is fundamentally a coding exercise with a known solution that is expressed as a set of unit tests that must pass in order to prove the correctness of a solution. Each eval consists of:
15+
16+
- **Problem Description**: Clear instructions explaining what needs to be implemented
17+
- **Implementation Stub**: A skeleton file with function signatures but no implementation
18+
- **Unit Tests**: Comprehensive test suite that validates the correctness of the solution
19+
- **Success Criteria**: The AI must implement the solution such that all unit tests pass
20+
21+
The key principle is that the tests define the contract - if all tests pass, the solution is considered correct. This provides an objective, automated way to measure AI coding performance across different programming languages and problem domains.
22+
23+
**Example Flow**:
24+
25+
1. AI receives a problem description (e.g., "implement a function that reverses a string")
26+
2. AI examines the stub implementation and test file
27+
3. AI writes code to make all tests pass
28+
4. System runs tests to verify correctness
29+
5. Success is measured by test pass/fail rate
30+
31+
## System Overview
32+
33+
The evals system consists of several key components:
34+
35+
- **Exercises Repository**: [`Roo-Code-Evals`](https://github.com/RooCodeInc/Roo-Code-Evals) - Contains all exercise definitions
36+
- **Web Interface**: [`apps/web-evals`](../apps/web-evals) - Management interface for creating and monitoring evaluation runs
37+
- **Evals Package**: [`packages/evals`](../packages/evals) - Contains both controller logic for orchestrating evaluation runs and runner container code for executing individual tasks
38+
- **Docker Configuration**: Container definitions for the `controller` and `runner` as well as a Docker Compose file that provisions Postgres and Redis instances required for eval runs.
39+
40+
### Current Language Support
41+
42+
The system currently supports these programming languages:
43+
44+
- **Go** - `go test` for testing
45+
- **Java** - Maven/Gradle for testing
46+
- **JavaScript** - Node.js with Jest/Mocha
47+
- **Python** - pytest for testing
48+
- **Rust** - `cargo test` for testing
49+
50+
## Adding Exercises to Existing Languages
51+
52+
TL;DR - Here's a pull request that adds a new JavaScript eval: https://github.com/RooCodeInc/Roo-Code-Evals/pull/3
53+
54+
### Step 1: Understand the Exercise Structure
55+
56+
Each exercise follows a standardized directory structure:
57+
58+
```
59+
/evals/{language}/{exercise-name}/
60+
├── docs/
61+
│ ├── instructions.md # Main exercise description
62+
│ └── instructions.append.md # Additional instructions (optional)
63+
├── {exercise-name}.{ext} # Implementation stub
64+
├── {exercise-name}_test.{ext} # Test file
65+
└── {language-specific-files} # go.mod, package.json, etc.
66+
```
67+
68+
### Step 2: Create Exercise Directory
69+
70+
1. **Clone the evals repository**:
71+
72+
```bash
73+
git clone https://github.com/RooCodeInc/Roo-Code-Evals.git evals
74+
cd evals
75+
```
76+
77+
2. **Create exercise directory**:
78+
```bash
79+
mkdir {language}/{exercise-name}
80+
cd {language}/{exercise-name}
81+
```
82+
83+
### Step 3: Write Exercise Instructions
84+
85+
Create `docs/instructions.md` with a clear problem description:
86+
87+
```markdown
88+
# Instructions
89+
90+
Create an implementation of [problem description].
91+
92+
## Problem Description
93+
94+
[Detailed explanation of what needs to be implemented]
95+
96+
## Examples
97+
98+
- Input: [example input]
99+
- Output: [expected output]
100+
101+
## Constraints
102+
103+
- [Any constraints or requirements]
104+
```
105+
106+
**Example from a simple reverse-string exercise**:
107+
108+
```markdown
109+
# Instructions
110+
111+
Create a function that reverses a string.
112+
113+
## Problem Description
114+
115+
Write a function called `reverse` that takes a string as input and returns the string with its characters in reverse order.
116+
117+
## Examples
118+
119+
- Input: `reverse("hello")` → Output: `"olleh"`
120+
- Input: `reverse("world")` → Output: `"dlrow"`
121+
- Input: `reverse("")` → Output: `""`
122+
- Input: `reverse("a")` → Output: `"a"`
123+
124+
## Constraints
125+
126+
- Input will always be a valid string
127+
- Empty strings should return empty strings
128+
```
129+
130+
### Step 4: Create Implementation Stub
131+
132+
Create the main implementation file with function signatures but no implementation:
133+
134+
**Python example** (`reverse_string.py`):
135+
136+
```python
137+
def reverse(text):
138+
pass
139+
```
140+
141+
**Go example** (`reverse_string.go`):
142+
143+
```go
144+
package reversestring
145+
146+
// Reverse returns the input string with its characters in reverse order
147+
func Reverse(s string) string {
148+
// TODO: implement
149+
return ""
150+
}
151+
```
152+
153+
### Step 5: Write Comprehensive Tests
154+
155+
Create test files that validate the implementation:
156+
157+
**Python example** (`reverse_string_test.py`):
158+
159+
```python
160+
import unittest
161+
from reverse_string import reverse
162+
163+
class ReverseStringTest(unittest.TestCase):
164+
def test_reverse_hello(self):
165+
self.assertEqual(reverse("hello"), "olleh")
166+
167+
def test_reverse_world(self):
168+
self.assertEqual(reverse("world"), "dlrow")
169+
170+
def test_reverse_empty_string(self):
171+
self.assertEqual(reverse(""), "")
172+
173+
def test_reverse_single_character(self):
174+
self.assertEqual(reverse("a"), "a")
175+
```
176+
177+
**Go example** (`reverse_string_test.go`):
178+
179+
```go
180+
package reversestring
181+
182+
import "testing"
183+
184+
func TestReverse(t *testing.T) {
185+
tests := []struct {
186+
input string
187+
expected string
188+
}{
189+
{"hello", "olleh"},
190+
{"world", "dlrow"},
191+
{"", ""},
192+
{"a", "a"},
193+
}
194+
195+
for _, test := range tests {
196+
result := Reverse(test.input)
197+
if result != test.expected {
198+
t.Errorf("Reverse(%q) = %q, expected %q", test.input, result, test.expected)
199+
}
200+
}
201+
}
202+
```
203+
204+
### Step 6: Add Language-Specific Configuration
205+
206+
**For Go exercises**, create `go.mod`:
207+
208+
```go
209+
module reverse-string
210+
211+
go 1.18
212+
```
213+
214+
**For Python exercises**, ensure the parent directory has `pyproject.toml`:
215+
216+
```toml
217+
[project]
218+
name = "python-exercises"
219+
version = "0.1.0"
220+
description = "Python exercises for Roo Code evals"
221+
requires-python = ">=3.9"
222+
dependencies = [
223+
"pytest>=8.3.5",
224+
]
225+
```
226+
227+
### Step 7: Test Locally
228+
229+
Before committing, test your exercise locally:
230+
231+
**Python**:
232+
233+
```bash
234+
cd python/reverse-string
235+
uv run python3 -m pytest -o markers=task reverse_string_test.py
236+
```
237+
238+
**Go**:
239+
240+
```bash
241+
cd go/reverse-string
242+
go test
243+
```
244+
245+
The tests should **fail** with the stub implementation and **pass** when properly implemented.
246+
247+
## Adding Support for New Programming Languages
248+
249+
Adding a new programming language requires changes to both the evals repository and the main Roo Code repository.
250+
251+
### Step 1: Update Language Configuration
252+
253+
1. **Add language to supported list** in [`packages/evals/src/exercises/index.ts`](../packages/evals/src/exercises/index.ts):
254+
255+
```typescript
256+
export const exerciseLanguages = [
257+
"go",
258+
"java",
259+
"javascript",
260+
"python",
261+
"rust",
262+
"your-new-language", // Add here
263+
] as const
264+
```
265+
266+
### Step 2: Create Language-Specific Prompt
267+
268+
Create `prompts/{language}.md` in the evals repository:
269+
270+
```markdown
271+
Your job is to complete a coding exercise described the markdown files inside the `docs` directory.
272+
273+
A file with the implementation stubbed out has been created for you, along with a test file (the tests should be failing initially).
274+
275+
To successfully complete the exercise, you must pass all the tests in the test file.
276+
277+
To confirm that your solution is correct, run the tests with `{test-command}`. Do not alter the test file; it should be run as-is.
278+
279+
Do not use the "ask_followup_question" tool. Your job isn't done until the tests pass. Don't attempt completion until you run the tests and they pass.
280+
281+
You should start by reading the files in the `docs` directory so that you understand the exercise, and then examine the stubbed out implementation and the test file.
282+
```
283+
284+
Replace `{test-command}` with the appropriate testing command for your language.
285+
286+
### Step 3: Update Docker Configuration
287+
288+
Modify [`packages/evals/Dockerfile.runner`](../packages/evals/Dockerfile.runner) to install the new language runtime:
289+
290+
```dockerfile
291+
# Install your new language runtime
292+
RUN apt update && apt install -y your-language-runtime
293+
294+
# Or for languages that need special installation:
295+
ARG YOUR_LANGUAGE_VERSION=1.0.0
296+
RUN curl -sSL https://install-your-language.sh | sh -s -- --version ${YOUR_LANGUAGE_VERSION}
297+
```
298+
299+
### Step 4: Update Test Runner Integration
300+
301+
If your language requires special test execution, update [`packages/evals/src/cli/runUnitTest.ts`](../packages/evals/src/cli/runUnitTest.ts) to handle the new language's testing framework.
302+
303+
### Step 5: Create Initial Exercises
304+
305+
Create at least 2-3 exercises for the new language following the structure described in the previous section.

0 commit comments

Comments
 (0)