Document how to add evals (RooCodeInc#4470)

cte · web-flow · commit 2bf4347cc324 · 2025-06-09T14:26:19.000-07:00
diff --git a/packages/evals/ADDING-EVALS.md b/packages/evals/ADDING-EVALS.md
@@ -0,0 +1,305 @@
+# Adding Additional Evals Exercises
+
+This guide explains how to add new coding exercises to the Roo Code evals system. The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments to test AI coding capabilities across multiple programming languages.
+
+## Table of Contents
+
+1. [What is an "Eval"?](#what-is-an-eval)
+2. [System Overview](#system-overview)
+3. [Adding Exercises to Existing Languages](#adding-exercises-to-existing-languages)
+4. [Adding Support for New Programming Languages](#adding-support-for-new-programming-languages)
+
+## What is an "Eval"?
+
+An **eval** (evaluation) is fundamentally a coding exercise with a known solution that is expressed as a set of unit tests that must pass in order to prove the correctness of a solution. Each eval consists of:
+
+- **Problem Description**: Clear instructions explaining what needs to be implemented
+- **Implementation Stub**: A skeleton file with function signatures but no implementation
+- **Unit Tests**: Comprehensive test suite that validates the correctness of the solution
+- **Success Criteria**: The AI must implement the solution such that all unit tests pass
+
+The key principle is that the tests define the contract - if all tests pass, the solution is considered correct. This provides an objective, automated way to measure AI coding performance across different programming languages and problem domains.
+
+**Example Flow**:
+
+1. AI receives a problem description (e.g., "implement a function that reverses a string")
+2. AI examines the stub implementation and test file
+3. AI writes code to make all tests pass
+4. System runs tests to verify correctness
+5. Success is measured by test pass/fail rate
+
+## System Overview
+
+The evals system consists of several key components:
+
+- **Exercises Repository**: [`Roo-Code-Evals`](https://github.com/RooCodeInc/Roo-Code-Evals) - Contains all exercise definitions
+- **Web Interface**: [`apps/web-evals`](../apps/web-evals) - Management interface for creating and monitoring evaluation runs
+- **Evals Package**: [`packages/evals`](../packages/evals) - Contains both controller logic for orchestrating evaluation runs and runner container code for executing individual tasks
+- **Docker Configuration**: Container definitions for the `controller` and `runner` as well as a Docker Compose file that provisions Postgres and Redis instances required for eval runs.
+
+### Current Language Support
+
+The system currently supports these programming languages:
+
+- **Go** - `go test` for testing
+- **Java** - Maven/Gradle for testing
+- **JavaScript** - Node.js with Jest/Mocha
+- **Python** - pytest for testing
+- **Rust** - `cargo test` for testing
+
+## Adding Exercises to Existing Languages
+
+TL;DR - Here's a pull request that adds a new JavaScript eval: https://github.com/RooCodeInc/Roo-Code-Evals/pull/3
+
+### Step 1: Understand the Exercise Structure
+
+Each exercise follows a standardized directory structure:
+
+```
+/evals/{language}/{exercise-name}/
+├── docs/
+│   ├── instructions.md          # Main exercise description
+│   └── instructions.append.md   # Additional instructions (optional)
+├── {exercise-name}.{ext}        # Implementation stub
+├── {exercise-name}_test.{ext}   # Test file
+└── {language-specific-files}    # go.mod, package.json, etc.
+```
+
+### Step 2: Create Exercise Directory
+
+1. **Clone the evals repository**:
+
+    ```bash
+    git clone https://github.com/RooCodeInc/Roo-Code-Evals.git evals
+    cd evals
+    ```
+
+2. **Create exercise directory**:
+    ```bash
+    mkdir {language}/{exercise-name}
+    cd {language}/{exercise-name}
+    ```
+
+### Step 3: Write Exercise Instructions
+
+Create `docs/instructions.md` with a clear problem description:
+
+```markdown
+# Instructions
+
+Create an implementation of [problem description].
+
+## Problem Description
+
+[Detailed explanation of what needs to be implemented]
+
+## Examples
+
+- Input: [example input]
+- Output: [expected output]
+
+## Constraints
+
+- [Any constraints or requirements]
+```
+
+**Example from a simple reverse-string exercise**:
+
+```markdown
+# Instructions
+
+Create a function that reverses a string.
+
+## Problem Description
+
+Write a function called `reverse` that takes a string as input and returns the string with its characters in reverse order.
+
+## Examples
+
+- Input: `reverse("hello")` → Output: `"olleh"`
+- Input: `reverse("world")` → Output: `"dlrow"`
+- Input: `reverse("")` → Output: `""`
+- Input: `reverse("a")` → Output: `"a"`
+
+## Constraints
+
+- Input will always be a valid string
+- Empty strings should return empty strings
+```
+
+### Step 4: Create Implementation Stub
+
+Create the main implementation file with function signatures but no implementation:
+
+**Python example** (`reverse_string.py`):
+
+```python
+def reverse(text):
+    pass
+```
+
+**Go example** (`reverse_string.go`):
+
+```go
+package reversestring
+
+// Reverse returns the input string with its characters in reverse order
+func Reverse(s string) string {
+    // TODO: implement
+    return ""
+}
+```
+
+### Step 5: Write Comprehensive Tests
+
+Create test files that validate the implementation:
+
+**Python example** (`reverse_string_test.py`):
+
+```python
+import unittest
+from reverse_string import reverse
+
+class ReverseStringTest(unittest.TestCase):
+    def test_reverse_hello(self):
+        self.assertEqual(reverse("hello"), "olleh")
+
+    def test_reverse_world(self):
+        self.assertEqual(reverse("world"), "dlrow")
+
+    def test_reverse_empty_string(self):
+        self.assertEqual(reverse(""), "")
+
+    def test_reverse_single_character(self):
+        self.assertEqual(reverse("a"), "a")
+```
+
+**Go example** (`reverse_string_test.go`):
+
+```go
+package reversestring
+
+import "testing"
+
+func TestReverse(t *testing.T) {
+    tests := []struct {
+        input    string
+        expected string
+    }{
+        {"hello", "olleh"},
+        {"world", "dlrow"},
+        {"", ""},
+        {"a", "a"},
+    }
+
+    for _, test := range tests {
+        result := Reverse(test.input)
+        if result != test.expected {
+            t.Errorf("Reverse(%q) = %q, expected %q", test.input, result, test.expected)
+        }
+    }
+}
+```
+
+### Step 6: Add Language-Specific Configuration
+
+**For Go exercises**, create `go.mod`:
+
+```go
+module reverse-string
+
+go 1.18
+```
+
+**For Python exercises**, ensure the parent directory has `pyproject.toml`:
+
+```toml
+[project]
+name = "python-exercises"
+version = "0.1.0"
+description = "Python exercises for Roo Code evals"
+requires-python = ">=3.9"
+dependencies = [
+    "pytest>=8.3.5",
+]
+```
+
+### Step 7: Test Locally
+
+Before committing, test your exercise locally:
+
+**Python**:
+
+```bash
+cd python/reverse-string
+uv run python3 -m pytest -o markers=task reverse_string_test.py
+```
+
+**Go**:
+
+```bash
+cd go/reverse-string
+go test
+```
+
+The tests should **fail** with the stub implementation and **pass** when properly implemented.
+
+## Adding Support for New Programming Languages
+
+Adding a new programming language requires changes to both the evals repository and the main Roo Code repository.
+
+### Step 1: Update Language Configuration
+
+1. **Add language to supported list** in [`packages/evals/src/exercises/index.ts`](../packages/evals/src/exercises/index.ts):
+
+```typescript
+export const exerciseLanguages = [
+	"go",
+	"java",
+	"javascript",
+	"python",
+	"rust",
+	"your-new-language", // Add here
+] as const
+```
+
+### Step 2: Create Language-Specific Prompt
+
+Create `prompts/{language}.md` in the evals repository:
+
+```markdown
+Your job is to complete a coding exercise described the markdown files inside the `docs` directory.
+
+A file with the implementation stubbed out has been created for you, along with a test file (the tests should be failing initially).
+
+To successfully complete the exercise, you must pass all the tests in the test file.
+
+To confirm that your solution is correct, run the tests with `{test-command}`. Do not alter the test file; it should be run as-is.
+
+Do not use the "ask_followup_question" tool. Your job isn't done until the tests pass. Don't attempt completion until you run the tests and they pass.
+
+You should start by reading the files in the `docs` directory so that you understand the exercise, and then examine the stubbed out implementation and the test file.
+```
+
+Replace `{test-command}` with the appropriate testing command for your language.
+
+### Step 3: Update Docker Configuration
+
+Modify [`packages/evals/Dockerfile.runner`](../packages/evals/Dockerfile.runner) to install the new language runtime:
+
+```dockerfile
+# Install your new language runtime
+RUN apt update && apt install -y your-language-runtime
+
+# Or for languages that need special installation:
+ARG YOUR_LANGUAGE_VERSION=1.0.0
+RUN curl -sSL https://install-your-language.sh | sh -s -- --version ${YOUR_LANGUAGE_VERSION}
+```
+
+### Step 4: Update Test Runner Integration
+
+If your language requires special test execution, update [`packages/evals/src/cli/runUnitTest.ts`](../packages/evals/src/cli/runUnitTest.ts) to handle the new language's testing framework.
+
+### Step 5: Create Initial Exercises
+
+Create at least 2-3 exercises for the new language following the structure described in the previous section.