-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Document how to add evals #4470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,305 @@ | ||
| # Adding Additional Evals Exercises | ||
|
|
||
| This guide explains how to add new coding exercises to the Roo Code evals system. The evals system is a distributed evaluation platform that runs AI coding tasks in isolated VS Code environments to test AI coding capabilities across multiple programming languages. | ||
|
|
||
| ## Table of Contents | ||
|
|
||
| 1. [What is an "Eval"?](#what-is-an-eval) | ||
| 2. [System Overview](#system-overview) | ||
| 3. [Adding Exercises to Existing Languages](#adding-exercises-to-existing-languages) | ||
| 4. [Adding Support for New Programming Languages](#adding-support-for-new-programming-languages) | ||
|
|
||
| ## What is an "Eval"? | ||
|
|
||
| An **eval** (evaluation) is fundamentally a coding exercise with a known solution that is expressed as a set of unit tests that must pass in order to prove the correctness of a solution. Each eval consists of: | ||
|
|
||
| - **Problem Description**: Clear instructions explaining what needs to be implemented | ||
| - **Implementation Stub**: A skeleton file with function signatures but no implementation | ||
| - **Unit Tests**: Comprehensive test suite that validates the correctness of the solution | ||
| - **Success Criteria**: The AI must implement the solution such that all unit tests pass | ||
|
|
||
| The key principle is that the tests define the contract - if all tests pass, the solution is considered correct. This provides an objective, automated way to measure AI coding performance across different programming languages and problem domains. | ||
|
|
||
| **Example Flow**: | ||
|
|
||
| 1. AI receives a problem description (e.g., "implement a function that reverses a string") | ||
| 2. AI examines the stub implementation and test file | ||
| 3. AI writes code to make all tests pass | ||
| 4. System runs tests to verify correctness | ||
| 5. Success is measured by test pass/fail rate | ||
|
|
||
| ## System Overview | ||
|
|
||
| The evals system consists of several key components: | ||
|
|
||
| - **Exercises Repository**: [`Roo-Code-Evals`](https://github.com/RooCodeInc/Roo-Code-Evals) - Contains all exercise definitions | ||
| - **Web Interface**: [`apps/web-evals`](../apps/web-evals) - Management interface for creating and monitoring evaluation runs | ||
| - **Evals Package**: [`packages/evals`](../packages/evals) - Contains both controller logic for orchestrating evaluation runs and runner container code for executing individual tasks | ||
| - **Docker Configuration**: Container definitions for the `controller` and `runner` as well as a Docker Compose file that provisions Postgres and Redis instances required for eval runs. | ||
|
|
||
| ### Current Language Support | ||
|
|
||
| The system currently supports these programming languages: | ||
|
|
||
| - **Go** - `go test` for testing | ||
| - **Java** - Maven/Gradle for testing | ||
| - **JavaScript** - Node.js with Jest/Mocha | ||
| - **Python** - pytest for testing | ||
| - **Rust** - `cargo test` for testing | ||
|
|
||
| ## Adding Exercises to Existing Languages | ||
|
|
||
| TL;DR - Here's a pull request that adds a new JavaScript eval: https://github.com/RooCodeInc/Roo-Code-Evals/pull/3 | ||
|
|
||
| ### Step 1: Understand the Exercise Structure | ||
|
|
||
| Each exercise follows a standardized directory structure: | ||
|
|
||
| ``` | ||
| /evals/{language}/{exercise-name}/ | ||
| ├── docs/ | ||
| │ ├── instructions.md # Main exercise description | ||
| │ └── instructions.append.md # Additional instructions (optional) | ||
| ├── {exercise-name}.{ext} # Implementation stub | ||
| ├── {exercise-name}_test.{ext} # Test file | ||
| └── {language-specific-files} # go.mod, package.json, etc. | ||
| ``` | ||
|
|
||
| ### Step 2: Create Exercise Directory | ||
|
|
||
| 1. **Clone the evals repository**: | ||
|
|
||
| ```bash | ||
| git clone https://github.com/RooCodeInc/Roo-Code-Evals.git evals | ||
| cd evals | ||
| ``` | ||
|
|
||
| 2. **Create exercise directory**: | ||
| ```bash | ||
| mkdir {language}/{exercise-name} | ||
| cd {language}/{exercise-name} | ||
| ``` | ||
|
|
||
| ### Step 3: Write Exercise Instructions | ||
|
|
||
| Create `docs/instructions.md` with a clear problem description: | ||
|
|
||
| ```markdown | ||
| # Instructions | ||
|
|
||
| Create an implementation of [problem description]. | ||
|
|
||
| ## Problem Description | ||
|
|
||
| [Detailed explanation of what needs to be implemented] | ||
|
|
||
| ## Examples | ||
|
|
||
| - Input: [example input] | ||
| - Output: [expected output] | ||
|
|
||
| ## Constraints | ||
|
|
||
| - [Any constraints or requirements] | ||
| ``` | ||
|
|
||
| **Example from a simple reverse-string exercise**: | ||
|
|
||
| ```markdown | ||
| # Instructions | ||
|
|
||
| Create a function that reverses a string. | ||
|
|
||
| ## Problem Description | ||
|
|
||
| Write a function called `reverse` that takes a string as input and returns the string with its characters in reverse order. | ||
|
|
||
| ## Examples | ||
|
|
||
| - Input: `reverse("hello")` → Output: `"olleh"` | ||
| - Input: `reverse("world")` → Output: `"dlrow"` | ||
| - Input: `reverse("")` → Output: `""` | ||
| - Input: `reverse("a")` → Output: `"a"` | ||
|
|
||
| ## Constraints | ||
|
|
||
| - Input will always be a valid string | ||
| - Empty strings should return empty strings | ||
| ``` | ||
|
|
||
| ### Step 4: Create Implementation Stub | ||
|
|
||
| Create the main implementation file with function signatures but no implementation: | ||
|
|
||
| **Python example** (`reverse_string.py`): | ||
|
|
||
| ```python | ||
| def reverse(text): | ||
| pass | ||
| ``` | ||
|
|
||
| **Go example** (`reverse_string.go`): | ||
|
|
||
| ```go | ||
| package reversestring | ||
|
|
||
| // Reverse returns the input string with its characters in reverse order | ||
| func Reverse(s string) string { | ||
| // TODO: implement | ||
| return "" | ||
| } | ||
| ``` | ||
|
|
||
| ### Step 5: Write Comprehensive Tests | ||
|
|
||
| Create test files that validate the implementation: | ||
|
|
||
| **Python example** (`reverse_string_test.py`): | ||
|
|
||
| ```python | ||
| import unittest | ||
| from reverse_string import reverse | ||
|
|
||
| class ReverseStringTest(unittest.TestCase): | ||
| def test_reverse_hello(self): | ||
| self.assertEqual(reverse("hello"), "olleh") | ||
|
|
||
| def test_reverse_world(self): | ||
| self.assertEqual(reverse("world"), "dlrow") | ||
|
|
||
| def test_reverse_empty_string(self): | ||
| self.assertEqual(reverse(""), "") | ||
|
|
||
| def test_reverse_single_character(self): | ||
| self.assertEqual(reverse("a"), "a") | ||
| ``` | ||
|
|
||
| **Go example** (`reverse_string_test.go`): | ||
|
|
||
| ```go | ||
| package reversestring | ||
|
|
||
| import "testing" | ||
|
|
||
| func TestReverse(t *testing.T) { | ||
| tests := []struct { | ||
| input string | ||
| expected string | ||
| }{ | ||
| {"hello", "olleh"}, | ||
| {"world", "dlrow"}, | ||
| {"", ""}, | ||
| {"a", "a"}, | ||
| } | ||
|
|
||
| for _, test := range tests { | ||
| result := Reverse(test.input) | ||
| if result != test.expected { | ||
| t.Errorf("Reverse(%q) = %q, expected %q", test.input, result, test.expected) | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### Step 6: Add Language-Specific Configuration | ||
|
|
||
| **For Go exercises**, create `go.mod`: | ||
|
|
||
| ```go | ||
| module reverse-string | ||
|
|
||
| go 1.18 | ||
| ``` | ||
|
|
||
| **For Python exercises**, ensure the parent directory has `pyproject.toml`: | ||
|
|
||
| ```toml | ||
| [project] | ||
| name = "python-exercises" | ||
| version = "0.1.0" | ||
| description = "Python exercises for Roo Code evals" | ||
| requires-python = ">=3.9" | ||
| dependencies = [ | ||
| "pytest>=8.3.5", | ||
| ] | ||
| ``` | ||
|
|
||
| ### Step 7: Test Locally | ||
|
|
||
| Before committing, test your exercise locally: | ||
|
|
||
| **Python**: | ||
|
|
||
| ```bash | ||
| cd python/reverse-string | ||
| uv run python3 -m pytest -o markers=task reverse_string_test.py | ||
| ``` | ||
|
|
||
| **Go**: | ||
|
|
||
| ```bash | ||
| cd go/reverse-string | ||
| go test | ||
| ``` | ||
|
|
||
| The tests should **fail** with the stub implementation and **pass** when properly implemented. | ||
|
|
||
| ## Adding Support for New Programming Languages | ||
|
|
||
| Adding a new programming language requires changes to both the evals repository and the main Roo Code repository. | ||
|
|
||
| ### Step 1: Update Language Configuration | ||
|
|
||
| 1. **Add language to supported list** in [`packages/evals/src/exercises/index.ts`](../packages/evals/src/exercises/index.ts): | ||
|
|
||
| ```typescript | ||
| export const exerciseLanguages = [ | ||
| "go", | ||
| "java", | ||
| "javascript", | ||
| "python", | ||
| "rust", | ||
| "your-new-language", // Add here | ||
| ] as const | ||
| ``` | ||
|
|
||
| ### Step 2: Create Language-Specific Prompt | ||
|
|
||
| Create `prompts/{language}.md` in the evals repository: | ||
|
|
||
| ```markdown | ||
| Your job is to complete a coding exercise described the markdown files inside the `docs` directory. | ||
|
|
||
| A file with the implementation stubbed out has been created for you, along with a test file (the tests should be failing initially). | ||
|
|
||
| To successfully complete the exercise, you must pass all the tests in the test file. | ||
|
|
||
| To confirm that your solution is correct, run the tests with `{test-command}`. Do not alter the test file; it should be run as-is. | ||
|
|
||
| Do not use the "ask_followup_question" tool. Your job isn't done until the tests pass. Don't attempt completion until you run the tests and they pass. | ||
|
|
||
| You should start by reading the files in the `docs` directory so that you understand the exercise, and then examine the stubbed out implementation and the test file. | ||
| ``` | ||
|
|
||
| Replace `{test-command}` with the appropriate testing command for your language. | ||
|
|
||
| ### Step 3: Update Docker Configuration | ||
|
|
||
| Modify [`packages/evals/Dockerfile.runner`](../packages/evals/Dockerfile.runner) to install the new language runtime: | ||
|
|
||
| ```dockerfile | ||
| # Install your new language runtime | ||
| RUN apt update && apt install -y your-language-runtime | ||
|
|
||
| # Or for languages that need special installation: | ||
| ARG YOUR_LANGUAGE_VERSION=1.0.0 | ||
| RUN curl -sSL https://install-your-language.sh | sh -s -- --version ${YOUR_LANGUAGE_VERSION} | ||
| ``` | ||
|
|
||
| ### Step 4: Update Test Runner Integration | ||
|
|
||
| If your language requires special test execution, update [`packages/evals/src/cli/runUnitTest.ts`](../packages/evals/src/cli/runUnitTest.ts) to handle the new language's testing framework. | ||
|
|
||
| ### Step 5: Create Initial Exercises | ||
|
|
||
| Create at least 2-3 exercises for the new language following the structure described in the previous section. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.