You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# why
adds "evaluateText" to evaluator
# what changed
evals can now use the evaluateText method to use the llms messages to
evaluate if it came across the proper information required by the eval.
This removes the dependency on extract within evals + makes it easier to
evaluate within the webvoyager / gaia evals when they are implemented
# test plan
tested locally
---------
Co-authored-by: Miguel <[email protected]>
systemPrompt =`You are an expert evaluator that confidently returns YES or NO given the state of a task (most times in the form of a screenshot) and a question. Provide a detailed reasoning for your answer.
65
-
Return your response as a JSON object with the following format:
66
-
{ "evaluation": "YES" | "NO", "reasoning": "detailed reasoning for your answer" }
67
-
Be critical about the question and the answer, the slightest detail might be the difference between yes and no.`,
68
-
screenshotDelayMs =1000,
54
+
answer,
55
+
screenshot =true,
56
+
systemPrompt =`You are an expert evaluator that confidently returns YES or NO given a question and the state of a task (in the form of a screenshot, or an answer). Provide a detailed reasoning for your answer.
57
+
Be critical about the question and the answer, the slightest detail might be the difference between yes and no. for text, be lenient and allow for slight variations in wording. we will be comparing the agents trajectory to see if it contains the information we were looking for in the answer.
58
+
Today's date is ${newDate().toLocaleDateString()}`,
59
+
screenshotDelayMs =250,
69
60
}=options;
61
+
if(!question){
62
+
thrownewError("Question cannot be an empty string");
63
+
}
64
+
if(!answer&&!screenshot){
65
+
thrownewError("Either answer (text) or screenshot must be provided");
systemPrompt=`You are an expert evaluator that confidently returns YES or NO for each question given the state of a task in the screenshot. Provide a detailed reasoning for your answer.
139
-
Return your response as a JSON array, where each object corresponds to a question and has the following format:
140
-
{ "evaluation": "YES" | "NO", "reasoning": "detailed reasoning for your answer" }
141
-
Be critical about the question and the answer, the slightest detail might be the difference between yes and no.`,
143
+
screenshot=true,
144
+
systemPrompt =`You are an expert evaluator that confidently returns YES or NO for each question given the state of a task (in the form of a screenshot, or an answer). Provide a detailed reasoning for your answer.
145
+
Be critical about the question and the answer, the slightest detail might be the difference between yes and no. for text, be lenient and allow for slight variations in wording. we will be comparing the agents trajectory to see if it contains the information we were looking for in the answer.
146
+
Today's date is ${newDate().toLocaleDateString()}`,
142
147
screenshotDelayMs =1000,
143
148
}=options;
144
149
150
+
// Validate inputs
151
+
if(!questions||questions.length===0){
152
+
thrownewError("Questions array cannot be empty");
153
+
}
154
+
155
+
for(constitemofquestions){
156
+
if(!item.question){
157
+
thrownewError("Question cannot be an empty string");
158
+
}
159
+
if(!item.answer&&!screenshot){
160
+
thrownewError(
161
+
"Either answer (text) or screenshot must be provided for each question",
162
+
);
163
+
}
164
+
}
165
+
145
166
// Wait for the specified delay before taking screenshot
content: `${systemPrompt}\n\nYou will be given multiple questions. Answer each question by returning an object in the specified JSON format. Return a single JSON array containing one object for each question in the order they were asked.`,
200
+
content: `${systemPrompt}\n\nYou will be given multiple questions${screenshot ? " with a screenshot" : ""}. ${questions.some((q)=>q.answer) ? "Some questions include answers to evaluate." : ""} Answer each question by returning an object in the specified JSON format. Return a single JSON array containing one object for each question in the order they were asked.`,
"Extract the trade-in value for an iPhone 13 Pro Max in good condition on the Apple website. it will be inside this text : Get x trade-in credit toward a new iPhone', provide just the number",
0 commit comments