Skip to content

Commit 78b54c6

Browse files
authored
docs: agent metrics (#1314)
1 parent 78d3bea commit 78b54c6

File tree

1 file changed

+109
-0
lines changed

1 file changed

+109
-0
lines changed

docs/concepts/metrics/agents.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Agentic or tool use metrics
2+
3+
Agentic or tool use workflows can be evaluated in multiple dimensions. Here are some of the metrics that can be used to evaluate the performance of agents or tools in a given task.
4+
5+
## Tool call Accuracy
6+
7+
Tool call accuracy is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs `user_input` and `reference_tool_calls` to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the `reference_tool_calls` with the Tool calls made by the AI. The values range between 0 and 1, with higher values indicating better performance.
8+
9+
```{code-block} python
10+
from ragas.dataset_schema import MultiTurnSample
11+
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
12+
from ragas.metrics._tool_call_accuracy import ToolCallAccuracy
13+
14+
15+
sample = MultiTurnSample(user_input=[
16+
HumanMessage(content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"),
17+
AIMessage(content="Sure, let me find the best options for you.", tool_calls=[
18+
ToolCall(name="restaurant_search", args={"cuisine": "Asian", "time": "8:00pm"})
19+
]),
20+
],
21+
reference_tool_calls=[ToolCall(name="restaurant_book", args={"name": "Golden", "time": "8:00pm"})
22+
])
23+
24+
scorer = ToolCallAccuracy()
25+
await metric.multi_turn_ascore(sample)
26+
```
27+
28+
By default the tool names and arguments are compared using exact string matching. But sometimes this might not be optimal, for example if the args are natural language strings. You can also use any ragas metrics (values between 0 and 1) as distance measure to identify if a retrieved context is relevant or not. For example,
29+
30+
```{code-block} python
31+
from ragas.metrics._string import NonLLMStringSimilarity
32+
from ragas.metrics._tool_call_accuracy import ToolCallAccuracy
33+
34+
metric = ToolCallAccuracy()
35+
metric.arg_comparison_metric = NonLLMStringSimilarity()
36+
```
37+
38+
## Agent Goal accuracy
39+
40+
41+
Agent goal accuracy is a metric that can be used to evaluate the performance of the LLM in identifying and achieving the goals of the user. This is a binary metric, with 1 indicating that the AI has achieved the goal and 0 indicating that the AI has not achieved the goal.
42+
43+
### With reference
44+
45+
Calculating agent goal accuracy with reference needs `user_input` and `reference` to evaluate the performance of the LLM in identifying and achieving the goals of the user. The annotated `reference` will be used as ideal outcome. The metric is computed by comparing the `reference` with the goal achieved by the end of workflow.
46+
47+
48+
```{code-block} python
49+
from ragas.dataset_schema import MultiTurnSample
50+
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
51+
from ragas.metrics._agent_goal_accuracy import AgentGoalAccuracyWithReference
52+
53+
54+
sample = MultiTurnSample(user_input=[
55+
HumanMessage(content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"),
56+
AIMessage(content="Sure, let me find the best options for you.", tool_calls=[
57+
ToolCall(name="restaurant_search", args={"cuisine": "Chinese", "time": "8:00pm"})
58+
]),
59+
ToolMessage(content="Found a few options: 1. Golden Dragon, 2. Jade Palace"),
60+
AIMessage(content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"),
61+
HumanMessage(content="Let's go with Golden Dragon."),
62+
AIMessage(content="Great choice! I'll book a table for 8:00pm at Golden Dragon.", tool_calls=[
63+
ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8:00pm"})
64+
]),
65+
ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
66+
AIMessage(content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"),
67+
HumanMessage(content="thanks"),
68+
],
69+
reference="Table booked at one of the chinese restaurants at 8 pm")
70+
71+
scorer = AgentGoalAccuracyWithReference()
72+
await metric.multi_turn_ascore(sample)
73+
74+
```
75+
76+
### Without reference
77+
78+
In without reference mode, the metric will evaluate the performance of the LLM in identifying and achieving the goals of the user without any reference. Here the desired outcome is inferred from the human interactions in the workflow.
79+
80+
81+
```{code-block} python
82+
83+
84+
```{code-block} python
85+
from ragas.dataset_schema import MultiTurnSample
86+
from ragas.messages import HumanMessage,AIMessage,ToolMessage,ToolCall
87+
from ragas.metrics._agent_goal_accuracy import AgentGoalAccuracyWithoutReference
88+
89+
90+
sample = MultiTurnSample(user_input=[
91+
HumanMessage(content="Hey, book a table at the nearest best Chinese restaurant for 8:00pm"),
92+
AIMessage(content="Sure, let me find the best options for you.", tool_calls=[
93+
ToolCall(name="restaurant_search", args={"cuisine": "Chinese", "time": "8:00pm"})
94+
]),
95+
ToolMessage(content="Found a few options: 1. Golden Dragon, 2. Jade Palace"),
96+
AIMessage(content="I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?"),
97+
HumanMessage(content="Let's go with Golden Dragon."),
98+
AIMessage(content="Great choice! I'll book a table for 8:00pm at Golden Dragon.", tool_calls=[
99+
ToolCall(name="restaurant_book", args={"name": "Golden Dragon", "time": "8:00pm"})
100+
]),
101+
ToolMessage(content="Table booked at Golden Dragon for 8:00pm."),
102+
AIMessage(content="Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!"),
103+
HumanMessage(content="thanks"),
104+
])
105+
106+
scorer = AgentGoalAccuracyWithoutReference()
107+
await metric.multi_turn_ascore(sample)
108+
109+
```

0 commit comments

Comments
 (0)