|
1 | 1 | # Evaluation Mode |
2 | 2 |
|
3 | | -Nerve provides an evaluation mode that allows you to test your agent's performance against a set of predefined test cases. This is useful for: |
| 3 | +Nerve's **evaluation mode** is a strategic feature designed to make benchmarking and validating agents easy, reproducible, and formalized. |
4 | 4 |
|
5 | | -- Validating agent behavior during development |
6 | | -- Regression testing after making changes |
7 | | -- Benchmarking different models |
8 | | -- Collecting metrics on agent performance |
| 5 | +> ⚡ Unlike most tools in the LLM ecosystem, Nerve offers a built-in framework to **test agents against structured cases**, log results, and compare performance across models. It introduces a standard formalism for agent evaluation that does not exist elsewhere. |
9 | 6 |
|
10 | | -An evaluation consists of an agent and a corresponding set of test cases. These cases can be defined in a `cases.yml` file, stored in a `cases.parquet` file, or organized as individual entries within separate folders. |
11 | 7 |
|
12 | | -Regardless of how you organize the evaluation cases, the agent will be executed for each one, with a specified number of runs per case. Task completion data and runtime statistics will be collected and saved to an output file. |
| 8 | +## 🎯 Why Use It? |
| 9 | +Evaluation mode is useful for: |
| 10 | +- Verifying agent correctness during development |
| 11 | +- Regression testing when updating prompts, tools, or models |
| 12 | +- Comparing different model backends |
| 13 | +- Collecting structured performance metrics |
13 | 14 |
|
| 15 | + |
| 16 | +## 🧪 Running an Evaluation |
| 17 | +You run evaluations using: |
14 | 18 | ```bash |
15 | 19 | nerve eval path/to/evaluation --output results.json |
16 | 20 | ``` |
| 21 | +Each case is passed to the agent, and results (e.g., completion, duration, output) are saved. |
17 | 22 |
|
18 | | -## YAML |
19 | 23 |
|
20 | | -You can place a `cases.yml` file in the agent folder with the different test cases. For instance, this is used in the [ab evaluation](https://github.com/evilsocket/eval-ab), where the evaluation cases look like: |
| 24 | +## 🗂 Case Formats |
| 25 | +Nerve supports three evaluation case formats: |
21 | 26 |
|
| 27 | +### 1. `cases.yml` |
| 28 | +For small test suites. Example: |
22 | 29 | ```yaml |
23 | 30 | - level1: |
24 | 31 | program: "A# #A" |
25 | 32 | - level2: |
26 | 33 | program: "A# #B B# #A" |
27 | | -# ... and so on |
28 | 34 | ``` |
29 | | - |
30 | | -These cases are interpolated in the agent prompt: |
31 | | - |
| 35 | +Used like this in the agent: |
32 | 36 | ```yaml |
33 | 37 | task: > |
34 | | - ## Problem |
35 | | -
|
36 | | - Now, consider the following program: |
| 38 | + Consider this program: |
37 | 39 |
|
38 | 40 | {{ program }} |
39 | 41 |
|
40 | | - Fully compute it, step by step and then submit the final result. |
| 42 | + Compute it step-by-step and submit the result. |
41 | 43 | ``` |
42 | 44 |
|
43 | | -## Parquet |
44 | | -
|
45 | | -For more complex test suite you can use a `cases.parquet` file. An example of this is [this MMLU evaluation](https://github.com/evilsocket/eval-mmlu) that is loading data from the [MMLU (dev) dataset](https://huggingface.co/datasets/cais/mmlu) and using it in the agent prompt: |
| 45 | +Used in [eval-ab](https://github.com/evilsocket/eval-ab). |
46 | 46 |
|
| 47 | +### 2. `cases.parquet` |
| 48 | +For large, structured datasets. Example from [eval-mmlu](https://github.com/evilsocket/eval-mmlu): |
47 | 49 | ```yaml |
48 | 50 | task: > |
49 | 51 | ## Question |
50 | 52 |
|
51 | 53 | {{ question }} |
52 | 54 |
|
53 | | - Use the select_choice tool to select the correct answer from this list of possible answers: |
54 | | -
|
| 55 | + Use the `select_choice` tool to pick the right answer: |
55 | 56 | {% for choice in choices %} |
56 | 57 | - [{{ loop.index0 }}] {{ choice }} |
57 | 58 | {% endfor %} |
58 | 59 | ``` |
59 | 60 |
|
60 | | -## Folders |
| 61 | +Can use HuggingFace datasets (e.g., MMLU) directly. |
| 62 | + |
| 63 | +### 3. Folder-Based `cases/` |
| 64 | +Organize each case in its own folder: |
| 65 | +``` |
| 66 | +cases/ |
| 67 | + level0/ |
| 68 | + input.txt |
| 69 | + level1/ |
| 70 | + input.txt |
| 71 | +``` |
| 72 | +Useful when tools/scripts dynamically load inputs. |
| 73 | +See [eval-regex](https://github.com/evilsocket/eval-regex). |
| 74 | + |
| 75 | + |
| 76 | +## 🧪 Output |
| 77 | +Results are written to a `.json` file with details like: |
| 78 | +- Case identifier |
| 79 | +- Task outcome (success/failure) |
| 80 | +- Runtime duration |
| 81 | +- Agent/tool outputs |
| 82 | + |
| 83 | + |
| 84 | +## 📎 Notes |
| 85 | +- You can define multiple runs per case for robustness |
| 86 | +- Compatible with any agent setup (tools, MCP, workflows, etc.) |
| 87 | +- All variables from each case are injected via `{{ ... }}` |
| 88 | + |
| 89 | + |
| 90 | +## 🧭 Related Docs |
| 91 | +- [concepts.md](concepts.md#evaluation) |
| 92 | +- [index.md](index.md): CLI usage |
| 93 | +- [mcp.md](mcp.md): when using remote agents or tools in evaluation |
61 | 94 |
|
62 | | -You can also divide your cases in a `cases` folder in order like in [the regex evaluation](https://github.com/evilsocket/eval-regex) where each input file is organized in `ccases/level0`, `cases/level1`, etc and [read at runtime](https://github.com/evilsocket/eval-regex/blob/main/tools.py#L11) by the tools. |
|
0 commit comments