Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions packages/evals/ADDING-EVALS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
# Adding Additional Evals Exercises

This guide explains how to add new coding exercises to the Roo Code evals system. The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments to test AI coding capabilities across multiple programming languages.

## Table of Contents

1. [What is an "Eval"?](#what-is-an-eval)
2. [System Overview](#system-overview)
3. [Adding Exercises to Existing Languages](#adding-exercises-to-existing-languages)
4. [Adding Support for New Programming Languages](#adding-support-for-new-programming-languages)

## What is an "Eval"?

An **eval** (evaluation) is fundamentally a coding exercise with a known solution that is expressed as a set of unit tests that must pass in order to prove the correctness of a solution. Each eval consists of:

- **Problem Description**: Clear instructions explaining what needs to be implemented
- **Implementation Stub**: A skeleton file with function signatures but no implementation
- **Unit Tests**: Comprehensive test suite that validates the correctness of the solution
- **Success Criteria**: The AI must implement the solution such that all unit tests pass

The key principle is that the tests define the contract - if all tests pass, the solution is considered correct. This provides an objective, automated way to measure AI coding performance across different programming languages and problem domains.

**Example Flow**:

1. AI receives a problem description (e.g., "implement a function that reverses a string")
2. AI examines the stub implementation and test file
3. AI writes code to make all tests pass
4. System runs tests to verify correctness
5. Success is measured by test pass/fail rate

## System Overview

The evals system consists of several key components:

- **Exercises Repository**: [`Roo-Code-Evals`](https://github.com/RooCodeInc/Roo-Code-Evals) - Contains all exercise definitions
- **Web Interface**: [`apps/web-evals`](../apps/web-evals) - Management interface for creating and monitoring evaluation runs
- **Evals Package**: [`packages/evals`](../packages/evals) - Contains both controller logic for orchestrating evaluation runs and runner container code for executing individual tasks
- **Docker Configuration**: Container definitions for the `controller` and `runner` as well as a Docker Compose file that provisions Postgres and Redis instances required for eval runs.

### Current Language Support

The system currently supports these programming languages:

- **Go** - `go test` for testing
- **Java** - Maven/Gradle for testing
- **JavaScript** - Node.js with Jest/Mocha
- **Python** - pytest for testing
- **Rust** - `cargo test` for testing

## Adding Exercises to Existing Languages

TL;DR - Here's a pull request that adds a new JavaScript eval: https://github.com/RooCodeInc/Roo-Code-Evals/pull/3

### Step 1: Understand the Exercise Structure

Each exercise follows a standardized directory structure:

```
/evals/{language}/{exercise-name}/
├── docs/
│ ├── instructions.md # Main exercise description
│ └── instructions.append.md # Additional instructions (optional)
├── {exercise-name}.{ext} # Implementation stub
├── {exercise-name}_test.{ext} # Test file
└── {language-specific-files} # go.mod, package.json, etc.
```

### Step 2: Create Exercise Directory

1. **Clone the evals repository**:

```bash
git clone https://github.com/RooCodeInc/Roo-Code-Evals.git evals
cd evals
```

2. **Create exercise directory**:
```bash
mkdir {language}/{exercise-name}
cd {language}/{exercise-name}
```

### Step 3: Write Exercise Instructions

Create `docs/instructions.md` with a clear problem description:

```markdown
# Instructions

Create an implementation of [problem description].

## Problem Description

[Detailed explanation of what needs to be implemented]

## Examples

- Input: [example input]
- Output: [expected output]

## Constraints

- [Any constraints or requirements]
```

**Example from a simple reverse-string exercise**:

```markdown
# Instructions

Create a function that reverses a string.

## Problem Description

Write a function called `reverse` that takes a string as input and returns the string with its characters in reverse order.

## Examples

- Input: `reverse("hello")` → Output: `"olleh"`
- Input: `reverse("world")` → Output: `"dlrow"`
- Input: `reverse("")` → Output: `""`
- Input: `reverse("a")` → Output: `"a"`

## Constraints

- Input will always be a valid string
- Empty strings should return empty strings
```

### Step 4: Create Implementation Stub

Create the main implementation file with function signatures but no implementation:

**Python example** (`reverse_string.py`):

```python
def reverse(text):
pass
```

**Go example** (`reverse_string.go`):

```go
package reversestring

// Reverse returns the input string with its characters in reverse order
func Reverse(s string) string {
// TODO: implement
return ""
}
```

### Step 5: Write Comprehensive Tests

Create test files that validate the implementation:

**Python example** (`reverse_string_test.py`):

```python
import unittest
from reverse_string import reverse

class ReverseStringTest(unittest.TestCase):
def test_reverse_hello(self):
self.assertEqual(reverse("hello"), "olleh")

def test_reverse_world(self):
self.assertEqual(reverse("world"), "dlrow")

def test_reverse_empty_string(self):
self.assertEqual(reverse(""), "")

def test_reverse_single_character(self):
self.assertEqual(reverse("a"), "a")
```

**Go example** (`reverse_string_test.go`):

```go
package reversestring

import "testing"

func TestReverse(t *testing.T) {
tests := []struct {
input string
expected string
}{
{"hello", "olleh"},
{"world", "dlrow"},
{"", ""},
{"a", "a"},
}

for _, test := range tests {
result := Reverse(test.input)
if result != test.expected {
t.Errorf("Reverse(%q) = %q, expected %q", test.input, result, test.expected)
}
}
}
```

### Step 6: Add Language-Specific Configuration

**For Go exercises**, create `go.mod`:

```go
module reverse-string

go 1.18
```

**For Python exercises**, ensure the parent directory has `pyproject.toml`:

```toml
[project]
name = "python-exercises"
version = "0.1.0"
description = "Python exercises for Roo Code evals"
requires-python = ">=3.9"
dependencies = [
"pytest>=8.3.5",
]
```

### Step 7: Test Locally

Before committing, test your exercise locally:

**Python**:

```bash
cd python/reverse-string
uv run python3 -m pytest -o markers=task reverse_string_test.py
```

**Go**:

```bash
cd go/reverse-string
go test
```

The tests should **fail** with the stub implementation and **pass** when properly implemented.

## Adding Support for New Programming Languages

Adding a new programming language requires changes to both the evals repository and the main Roo Code repository.

### Step 1: Update Language Configuration

1. **Add language to supported list** in [`packages/evals/src/exercises/index.ts`](../packages/evals/src/exercises/index.ts):

```typescript
export const exerciseLanguages = [
"go",
"java",
"javascript",
"python",
"rust",
"your-new-language", // Add here
] as const
```

### Step 2: Create Language-Specific Prompt

Create `prompts/{language}.md` in the evals repository:

```markdown
Your job is to complete a coding exercise described the markdown files inside the `docs` directory.

A file with the implementation stubbed out has been created for you, along with a test file (the tests should be failing initially).

To successfully complete the exercise, you must pass all the tests in the test file.

To confirm that your solution is correct, run the tests with `{test-command}`. Do not alter the test file; it should be run as-is.

Do not use the "ask_followup_question" tool. Your job isn't done until the tests pass. Don't attempt completion until you run the tests and they pass.

You should start by reading the files in the `docs` directory so that you understand the exercise, and then examine the stubbed out implementation and the test file.
```

Replace `{test-command}` with the appropriate testing command for your language.

### Step 3: Update Docker Configuration

Modify [`packages/evals/Dockerfile.runner`](../packages/evals/Dockerfile.runner) to install the new language runtime:

```dockerfile
# Install your new language runtime
RUN apt update && apt install -y your-language-runtime

# Or for languages that need special installation:
ARG YOUR_LANGUAGE_VERSION=1.0.0
RUN curl -sSL https://install-your-language.sh | sh -s -- --version ${YOUR_LANGUAGE_VERSION}
```

### Step 4: Update Test Runner Integration

If your language requires special test execution, update [`packages/evals/src/cli/runUnitTest.ts`](../packages/evals/src/cli/runUnitTest.ts) to handle the new language's testing framework.

### Step 5: Create Initial Exercises

Create at least 2-3 exercises for the new language following the structure described in the previous section.
Loading