You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
add agent reasoning to evaluator, improve prompts & paramaterize max … (#1050)
…steps
# why
- Evaluator often is too strict in its evaluation, resulting in false
positives
- max step limits in evals are very brittle and can cause false
negatives on tasks that exceed the current limit
- Evaluator can sometimes view screenshots as a single source of truth
resulting in false negatives. Providing the agents reasoning alongside
these heps mitigate this
# what changed
- paramaterized max steps to allow for easy configuration across all
evals through env
- adjusted evaluator prompting
- added "agent reasoning" param which can be used to pass in the agents
reasoning alongside bulk screenshots in a format evaluator can
understand well
# test plan
tested locally
Copy file name to clipboardExpand all lines: evals/evaluator.ts
+17-7Lines changed: 17 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -55,6 +55,7 @@ export class Evaluator {
55
55
screenshot =true,
56
56
systemPrompt,
57
57
screenshotDelayMs =250,
58
+
agentReasoning,
58
59
}=options;
59
60
if(!question){
60
61
thrownewError("Question cannot be an empty string");
@@ -69,12 +70,12 @@ export class Evaluator {
69
70
question,
70
71
screenshots: screenshot,
71
72
systemPrompt,
73
+
agentReasoning,
72
74
});
73
75
}
74
76
75
77
// Single screenshot case (existing logic)
76
-
constdefaultSystemPrompt=`You are an expert evaluator that confidently returns YES or NO given a question and the state of a task (in the form of a screenshot, or an answer). Provide a detailed reasoning for your answer.
77
-
Be critical about the question and the answer, the slightest detail might be the difference between yes and no. for text, be lenient and allow for slight variations in wording. we will be comparing the agents trajectory to see if it contains the information we were looking for in the answer.
78
+
constdefaultSystemPrompt=`You are an expert evaluator that confidently returns YES or NO based on if the original goal was achieved. You have access to ${screenshot ? "a screenshot" : "the agents reasoning and actions throughout the task"} that you can use to evaluate the tasks completion. Provide detailed reasoning for your answer.
78
79
Today's date is ${newDate().toLocaleDateString()}`;
? `Question: ${question}\n\nAgent's reasoning and actions taken:\n${agentReasoning}`
105
+
: question,
106
+
},
101
107
...(screenshot
102
108
? [
103
109
{
@@ -153,8 +159,7 @@ export class Evaluator {
153
159
const{
154
160
questions,
155
161
screenshot =true,
156
-
systemPrompt =`You are an expert evaluator that confidently returns YES or NO for each question given the state of a task (in the form of a screenshot, or an answer). Provide a detailed reasoning for your answer.
157
-
Be critical about the question and the answer, the slightest detail might be the difference between yes and no. for text, be lenient and allow for slight variations in wording. we will be comparing the agents trajectory to see if it contains the information we were looking for in the answer.
162
+
systemPrompt =`You are an expert evaluator that confidently returns YES or NO for each question given the state of a task based on the original goal. You have access to ${screenshot ? "a screenshot" : "the agents reasoning and actions throughout the task"} that you can use to evaluate the tasks completion. Provide detailed reasoning for your answer.
158
163
Today's date is ${newDate().toLocaleDateString()}`,
159
164
screenshotDelayMs =1000,
160
165
}=options;
@@ -260,14 +265,17 @@ export class Evaluator {
260
265
question: string;
261
266
screenshots: Buffer[];
262
267
systemPrompt?: string;
268
+
agentReasoning?: string;
263
269
}): Promise<EvaluationResult>{
264
270
const{
265
271
question,
266
272
screenshots,
273
+
agentReasoning,
267
274
systemPrompt =`You are an expert evaluator that confidently returns YES or NO given a question and multiple screenshots showing the progression of a task.
275
+
${agentReasoning ? "You also have access to the agent's detailed reasoning and thought process throughout the task." : ""}
268
276
Analyze ALL screenshots to understand the complete journey. Look for evidence of task completion across all screenshots, not just the last one.
269
277
Success criteria may appear at different points in the sequence (confirmation messages, intermediate states, etc).
270
-
Be critical about the question but consider the ENTIRE sequence when making your determination.
278
+
${agentReasoning ? "The agent's reasoning provides crucial context about what actions were attempted, what was observed, and the decision-making process. Use this alongside the visual evidence to make a comprehensive evaluation." : ""}
271
279
Today's date is ${newDate().toLocaleDateString()}`,
272
280
}=options;
273
281
@@ -303,7 +311,9 @@ export class Evaluator {
303
311
content: [
304
312
{
305
313
type: "text",
306
-
text: `${question}\n\nI'm providing ${screenshots.length} screenshots showing the progression of the task. Please analyze all of them to determine if the task was completed successfully.`,
314
+
text: agentReasoning
315
+
? `Question: ${question}\n\nAgent's reasoning and actions throughout the task:\n${agentReasoning}\n\nI'm providing ${screenshots.length} screenshots showing the progression of the task. Please analyze both the agent's reasoning and all screenshots to determine if the task was completed successfully.`
316
+
: `${question}\n\nI'm providing ${screenshots.length} screenshots showing the progression of the task. Please analyze all of them to determine if the task was completed successfully.`,
"Search for a recipe for Beef Wellington on Allrecipes that has at least 200 reviews and an average rating of 4.5 stars or higher. List the main ingredients required for the dish.",
0 commit comments