diff --git a/packages/evals/ADDING-EVALS.md b/packages/evals/ADDING-EVALS.md new file mode 100644 index 0000000000..b4df756db8 --- /dev/null +++ b/packages/evals/ADDING-EVALS.md @@ -0,0 +1,305 @@ +# Adding Additional Evals Exercises + +This guide explains how to add new coding exercises to the Roo Code evals system. The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments to test AI coding capabilities across multiple programming languages. + +## Table of Contents + +1. [What is an "Eval"?](#what-is-an-eval) +2. [System Overview](#system-overview) +3. [Adding Exercises to Existing Languages](#adding-exercises-to-existing-languages) +4. [Adding Support for New Programming Languages](#adding-support-for-new-programming-languages) + +## What is an "Eval"? + +An **eval** (evaluation) is fundamentally a coding exercise with a known solution that is expressed as a set of unit tests that must pass in order to prove the correctness of a solution. Each eval consists of: + +- **Problem Description**: Clear instructions explaining what needs to be implemented +- **Implementation Stub**: A skeleton file with function signatures but no implementation +- **Unit Tests**: Comprehensive test suite that validates the correctness of the solution +- **Success Criteria**: The AI must implement the solution such that all unit tests pass + +The key principle is that the tests define the contract - if all tests pass, the solution is considered correct. This provides an objective, automated way to measure AI coding performance across different programming languages and problem domains. + +**Example Flow**: + +1. AI receives a problem description (e.g., "implement a function that reverses a string") +2. AI examines the stub implementation and test file +3. AI writes code to make all tests pass +4. System runs tests to verify correctness +5. Success is measured by test pass/fail rate + +## System Overview + +The evals system consists of several key components: + +- **Exercises Repository**: [`Roo-Code-Evals`](https://github.com/RooCodeInc/Roo-Code-Evals) - Contains all exercise definitions +- **Web Interface**: [`apps/web-evals`](../apps/web-evals) - Management interface for creating and monitoring evaluation runs +- **Evals Package**: [`packages/evals`](../packages/evals) - Contains both controller logic for orchestrating evaluation runs and runner container code for executing individual tasks +- **Docker Configuration**: Container definitions for the `controller` and `runner` as well as a Docker Compose file that provisions Postgres and Redis instances required for eval runs. + +### Current Language Support + +The system currently supports these programming languages: + +- **Go** - `go test` for testing +- **Java** - Maven/Gradle for testing +- **JavaScript** - Node.js with Jest/Mocha +- **Python** - pytest for testing +- **Rust** - `cargo test` for testing + +## Adding Exercises to Existing Languages + +TL;DR - Here's a pull request that adds a new JavaScript eval: https://github.com/RooCodeInc/Roo-Code-Evals/pull/3 + +### Step 1: Understand the Exercise Structure + +Each exercise follows a standardized directory structure: + +``` +/evals/{language}/{exercise-name}/ +├── docs/ +│ ├── instructions.md # Main exercise description +│ └── instructions.append.md # Additional instructions (optional) +├── {exercise-name}.{ext} # Implementation stub +├── {exercise-name}_test.{ext} # Test file +└── {language-specific-files} # go.mod, package.json, etc. +``` + +### Step 2: Create Exercise Directory + +1. **Clone the evals repository**: + + ```bash + git clone https://github.com/RooCodeInc/Roo-Code-Evals.git evals + cd evals + ``` + +2. **Create exercise directory**: + ```bash + mkdir {language}/{exercise-name} + cd {language}/{exercise-name} + ``` + +### Step 3: Write Exercise Instructions + +Create `docs/instructions.md` with a clear problem description: + +```markdown +# Instructions + +Create an implementation of [problem description]. + +## Problem Description + +[Detailed explanation of what needs to be implemented] + +## Examples + +- Input: [example input] +- Output: [expected output] + +## Constraints + +- [Any constraints or requirements] +``` + +**Example from a simple reverse-string exercise**: + +```markdown +# Instructions + +Create a function that reverses a string. + +## Problem Description + +Write a function called `reverse` that takes a string as input and returns the string with its characters in reverse order. + +## Examples + +- Input: `reverse("hello")` → Output: `"olleh"` +- Input: `reverse("world")` → Output: `"dlrow"` +- Input: `reverse("")` → Output: `""` +- Input: `reverse("a")` → Output: `"a"` + +## Constraints + +- Input will always be a valid string +- Empty strings should return empty strings +``` + +### Step 4: Create Implementation Stub + +Create the main implementation file with function signatures but no implementation: + +**Python example** (`reverse_string.py`): + +```python +def reverse(text): + pass +``` + +**Go example** (`reverse_string.go`): + +```go +package reversestring + +// Reverse returns the input string with its characters in reverse order +func Reverse(s string) string { + // TODO: implement + return "" +} +``` + +### Step 5: Write Comprehensive Tests + +Create test files that validate the implementation: + +**Python example** (`reverse_string_test.py`): + +```python +import unittest +from reverse_string import reverse + +class ReverseStringTest(unittest.TestCase): + def test_reverse_hello(self): + self.assertEqual(reverse("hello"), "olleh") + + def test_reverse_world(self): + self.assertEqual(reverse("world"), "dlrow") + + def test_reverse_empty_string(self): + self.assertEqual(reverse(""), "") + + def test_reverse_single_character(self): + self.assertEqual(reverse("a"), "a") +``` + +**Go example** (`reverse_string_test.go`): + +```go +package reversestring + +import "testing" + +func TestReverse(t *testing.T) { + tests := []struct { + input string + expected string + }{ + {"hello", "olleh"}, + {"world", "dlrow"}, + {"", ""}, + {"a", "a"}, + } + + for _, test := range tests { + result := Reverse(test.input) + if result != test.expected { + t.Errorf("Reverse(%q) = %q, expected %q", test.input, result, test.expected) + } + } +} +``` + +### Step 6: Add Language-Specific Configuration + +**For Go exercises**, create `go.mod`: + +```go +module reverse-string + +go 1.18 +``` + +**For Python exercises**, ensure the parent directory has `pyproject.toml`: + +```toml +[project] +name = "python-exercises" +version = "0.1.0" +description = "Python exercises for Roo Code evals" +requires-python = ">=3.9" +dependencies = [ + "pytest>=8.3.5", +] +``` + +### Step 7: Test Locally + +Before committing, test your exercise locally: + +**Python**: + +```bash +cd python/reverse-string +uv run python3 -m pytest -o markers=task reverse_string_test.py +``` + +**Go**: + +```bash +cd go/reverse-string +go test +``` + +The tests should **fail** with the stub implementation and **pass** when properly implemented. + +## Adding Support for New Programming Languages + +Adding a new programming language requires changes to both the evals repository and the main Roo Code repository. + +### Step 1: Update Language Configuration + +1. **Add language to supported list** in [`packages/evals/src/exercises/index.ts`](../packages/evals/src/exercises/index.ts): + +```typescript +export const exerciseLanguages = [ + "go", + "java", + "javascript", + "python", + "rust", + "your-new-language", // Add here +] as const +``` + +### Step 2: Create Language-Specific Prompt + +Create `prompts/{language}.md` in the evals repository: + +```markdown +Your job is to complete a coding exercise described the markdown files inside the `docs` directory. + +A file with the implementation stubbed out has been created for you, along with a test file (the tests should be failing initially). + +To successfully complete the exercise, you must pass all the tests in the test file. + +To confirm that your solution is correct, run the tests with `{test-command}`. Do not alter the test file; it should be run as-is. + +Do not use the "ask_followup_question" tool. Your job isn't done until the tests pass. Don't attempt completion until you run the tests and they pass. + +You should start by reading the files in the `docs` directory so that you understand the exercise, and then examine the stubbed out implementation and the test file. +``` + +Replace `{test-command}` with the appropriate testing command for your language. + +### Step 3: Update Docker Configuration + +Modify [`packages/evals/Dockerfile.runner`](../packages/evals/Dockerfile.runner) to install the new language runtime: + +```dockerfile +# Install your new language runtime +RUN apt update && apt install -y your-language-runtime + +# Or for languages that need special installation: +ARG YOUR_LANGUAGE_VERSION=1.0.0 +RUN curl -sSL https://install-your-language.sh | sh -s -- --version ${YOUR_LANGUAGE_VERSION} +``` + +### Step 4: Update Test Runner Integration + +If your language requires special test execution, update [`packages/evals/src/cli/runUnitTest.ts`](../packages/evals/src/cli/runUnitTest.ts) to handle the new language's testing framework. + +### Step 5: Create Initial Exercises + +Create at least 2-3 exercises for the new language following the structure described in the previous section.