Add support for Evals

We should spec out an easy Eval system for testing agents:

* Write tests using simple Yaml
* Input prompt and expected output
* Indicated expected tool calls

Like:

```weather_agent_eval.yaml
name: Weather Agent Eval
evals:
  - it: should not know the time
     input: What time is it?
     eval_response: I don't have access to time information.
  - it: should know the weather report
     input: what is the weather in Toronto?
     eval_judge: An accurate weather report was returned.
  - it: should call the weather tool
     input: what is the weather in Toronto?
     expected_tools:
        - name: get_weather
           eval_arguments: Lat and long for Toronto
           eval_response: the weather report for Toronto
```

This is just a first pass. But the idea is that the `eval_xx` attributes mean that the LLM evaluates the response, or in the case of a tool call it could evaluate both the inputs and outputs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Evals #215

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for Evals #215

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions