You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nerve provides an evaluation mode that allows you to test your agent's performance against a set of predefined test cases. This is useful for:
4
+
5
+
- Validating agent behavior during development
6
+
- Regression testing after making changes
7
+
- Benchmarking different models
8
+
- Collecting metrics on agent performance
9
+
10
+
An evaluation consists of an agent and a corresponding set of test cases. These cases can be defined in a `cases.yml` file, stored in a `cases.parquet` file, or organized as individual entries within separate folders.
11
+
12
+
Regardless of how you organize the evaluation cases, the agent will be executed for each one, with a specified number of runs per case. Task completion data and runtime statistics will be collected and saved to an output file.
You can place a `cases.yml` file in the agent folder with the different test cases. For instance, this is used in the [ab evaluation](https://github.com/evilsocket/eval-ab), where the evaluation cases look like:
21
+
22
+
```yaml
23
+
- level1:
24
+
program: "A# #A"
25
+
- level2:
26
+
program: "A# #B B# #A"
27
+
# ... and so on
28
+
```
29
+
30
+
These cases are interpolated in the agent prompt:
31
+
32
+
```yaml
33
+
task: >
34
+
## Problem
35
+
36
+
Now, consider the following program:
37
+
38
+
{{ program }}
39
+
40
+
Fully compute it, step by step and then submit the final result.
41
+
```
42
+
43
+
## Parquet
44
+
45
+
For more complex test suite you can use a `cases.parquet` file. An example of this is [this MMLU evaluation](https://github.com/evilsocket/eval-mmlu) that is loading data from the [MMLU (dev) dataset](https://huggingface.co/datasets/cais/mmlu) and using it in the agent prompt:
46
+
47
+
```yaml
48
+
task: >
49
+
## Question
50
+
51
+
{{ question }}
52
+
53
+
Use the select_choice tool to select the correct answer from this list of possible answers:
54
+
55
+
{% for choice in choices %}
56
+
- [{{ loop.index0 }}] {{ choice }}
57
+
{% endfor %}
58
+
```
59
+
60
+
## Folders
61
+
62
+
You can also divide your cases in a `cases` folder in order like in [the regex evaluation](https://github.com/evilsocket/eval-regex) where each input file is organized in `ccases/level0`, `cases/level1`, etc and [read at runtime](https://github.com/evilsocket/eval-regex/blob/main/tools.py#L11) by the tools.
0 commit comments