Skip to content

Commit de84a7c

Browse files
committed
introduction
1 parent 0c6f76e commit de84a7c

File tree

2 files changed

+131
-1
lines changed

2 files changed

+131
-1
lines changed

evals/introduction.mdx

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
title: Introduction
3+
---
4+
5+
## What are Evals?
6+
7+
**Evals** are like tests for your Agents. Use them judiciously to evaluate the performance of your Agents and improve them over time.
8+
9+
We typically evaludate Agents on 3 dimensions:
10+
11+
- **Accuracy:** How accurate is the Agent's response?
12+
- **Performance:** How fast does the Agent produce the output and what is the memory footprint?
13+
- **Reliability:** Does the Agent make the expected tool calls?
14+
15+
### Accuracy
16+
17+
Accuracy evals use input/output pairs to evaluate the Agent's performance. They use another model to score the Agent's responses (LLM-as-a-judge).
18+
19+
#### Example
20+
21+
```python calculate_accuracy.py
22+
from typing import Optional
23+
24+
from agno.agent import Agent
25+
from agno.eval.accuracy import AccuracyEval, AccuracyResult
26+
from agno.models.openai import OpenAIChat
27+
from agno.tools.calculator import CalculatorTools
28+
29+
30+
def multiply_and_exponentiate():
31+
evaluation = AccuracyEval(
32+
agent=Agent(
33+
model=OpenAIChat(id="gpt-4o-mini"),
34+
tools=[CalculatorTools(add=True, multiply=True, exponentiate=True)],
35+
),
36+
question="What is 10*5 then to the power of 2? do it step by step",
37+
expected_answer="2500",
38+
num_iterations=1
39+
)
40+
result: Optional[AccuracyResult] = evaluation.run(print_results=True)
41+
42+
assert result is not None and result.avg_score >= 8
43+
44+
45+
if __name__ == "__main__":
46+
multiply_and_exponentiate()
47+
```
48+
49+
### Performance
50+
51+
Performance evals measure the latency and memory footprint of the Agent operations.
52+
53+
<Note>
54+
While latency will be dominated by the model API response time, we should still keep performance top of mind and track the agent performance with and without certain components. Eg: it would be good to know what's the average latency with and without storage, memory, with a new prompt, or with a new model.
55+
</Note>
56+
57+
#### Example
58+
59+
```python storage_performance.py
60+
"""Run `pip install openai agno` to install dependencies."""
61+
62+
from agno.agent import Agent
63+
from agno.models.openai import OpenAIChat
64+
from agno.eval.perf import PerfEval
65+
66+
def simple_response():
67+
agent = Agent(model=OpenAIChat(id='gpt-4o-mini'), system_message='Be concise, reply with one sentence.', add_history_to_messages=True)
68+
response_1 = agent.run('What is the capital of France?')
69+
print(response_1.content)
70+
response_2 = agent.run('How many people live there?')
71+
print(response_2.content)
72+
return response_2.content
73+
74+
75+
simple_response_perf = PerfEval(func=simple_response, num_iterations=1, warmup_runs=0)
76+
77+
if __name__ == "__main__":
78+
simple_response_perf.run(print_results=True)
79+
```
80+
81+
### Reliability
82+
83+
What makes an Agent reliable?
84+
85+
- Does the Agent make the expected tool calls?
86+
- Does the Agent handle errors gracefully?
87+
- Does the Agent respect the rate limits of the model API?
88+
89+
#### Example
90+
91+
The first check is to ensure the Agent makes the expected tool calls. Here's an example:
92+
93+
```python reliability.py
94+
from typing import Optional
95+
96+
from agno.agent import Agent
97+
from agno.eval.reliability import ReliabilityEval, ReliabilityResult
98+
from agno.tools.calculator import CalculatorTools
99+
from agno.models.openai import OpenAIChat
100+
from agno.run.response import RunResponse
101+
102+
103+
def multiply_and_exponentiate():
104+
105+
agent=Agent(
106+
model=OpenAIChat(id="gpt-4o-mini"),
107+
tools=[CalculatorTools(add=True, multiply=True, exponentiate=True)],
108+
)
109+
response: RunResponse = agent.run("What is 10*5 then to the power of 2? do it step by step")
110+
evaluation = ReliabilityEval(
111+
agent_response=response,
112+
expected_tool_calls=["multiply", "exponentiate"],
113+
)
114+
result: Optional[ReliabilityResult] = evaluation.run(print_results=True)
115+
result.assert_passed()
116+
117+
118+
if __name__ == "__main__":
119+
multiply_and_exponentiate()
120+
```
121+
122+
<Note>
123+
Reliability evals are currently in `beta`.
124+
</Note>

mint.json

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,12 @@
340340
"workflows/session_state",
341341
"workflows/advanced"
342342
]
343+
},
344+
{
345+
"group": "Evals",
346+
"pages": [
347+
"evals/introduction"
348+
]
343349
}
344350
]
345351
},
@@ -450,7 +456,7 @@
450456
"examples/workflows/blog-post-generator",
451457
"examples/workflows/investment-report-generator",
452458
"examples/workflows/personalized-email-generator",
453-
"examples/workflows/startup-idea-validator",
459+
"examples/workflows/startup-idea-validator",
454460
"examples/workflows/content-creator",
455461
"examples/workflows/product-manager",
456462
"examples/workflows/team-workflow"

0 commit comments

Comments
 (0)