Skip to content

Commit 41f8f69

Browse files
authored
Update summarizer_prompts.py
1 parent 2be23e5 commit 41f8f69

File tree

1 file changed

+39
-67
lines changed

1 file changed

+39
-67
lines changed

src/agentlab/analyze/error_analysis/summarizer_prompts.py

Lines changed: 39 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -53,55 +53,37 @@
5353

5454
ERROR_CLASSIFICATION_PROMPT = """
5555
You are an expert evaluator that classifies web agent failures according to a predefined taxonomy.
56-
Below are the high-level definitions of each top-level category (Agent Errors, Language Model Errors, and Benchmark/Environment Errors),
57-
followed by an explanation of the inputs you will receive (planning history, chain of thought, etc.),
56+
Below are the high-level definitions of each category,
57+
followed by an explanation of the inputs of the interaction you will receive (planning history, chain of thought, etc.),
5858
a set of labeled examples for reference (few-shot), and finally the classification task you must complete.
5959
6060
--------------------------------------------------------------------------------
6161
TAXONOMY DEFINITIONS
6262
--------------------------------------------------------------------------------
6363
64-
1. AGENT ERRORS
65-
These errors arise when agents interact with web interfaces and fail due to limitations in perception, navigation, or manipulation.
64+
1. Navigation & Planning Errors
65+
The agent cannot construct or execute a correct sequence of actions to reach its goal
66+
(e.g., getting lost on a website, failing to recover from missteps, or using incorrect search terms).
6667
67-
- Navigation & Planning Errors
68-
The agent cannot construct or execute a correct sequence of actions to reach its goal
69-
(e.g., getting lost on a website, failing to recover from missteps, or using incorrect search terms).
68+
2. Interaction Execution Errors
69+
The agent enters data in the wrong format, forgets to click "Submit" after typing,
70+
repeats the same failing action without adaptation, or loses track of the changing webpage state.
7071
71-
- Interaction Execution Errors
72-
The agent enters data in the wrong format, forgets to click "Submit" after typing,
73-
repeats the same failing action without adaptation, or loses track of the changing webpage state.
72+
3. Information Processing Errors
73+
The agent misreads or misinterprets visible data (e.g., extracting the wrong field values),
74+
misconstrues relationships between pieces of information, or fails to validate data against task requirements.
7475
75-
- Information Processing Errors
76-
The agent misreads or misinterprets visible data (e.g., extracting the wrong field values),
77-
misconstrues relationships between pieces of information, or fails to validate data against task requirements.
76+
4. Observation & Action Errors
77+
The agent fails to observe important updates in the environment (e.g., not noticing the page reloaded)
78+
or misaligns its actions (clicks the wrong element or stale link).
7879
79-
- Observation & Action Errors
80-
The agent fails to observe important updates in the environment (e.g., not noticing the page reloaded)
81-
or misaligns its actions (clicks the wrong element or stale link).
80+
5. Task Understanding Errors
81+
The agent misreads or misunderstands the user's objective (goal interpretation),
82+
loses crucial context (context loss), or performs actions beyond or short of the intended scope.
8283
83-
2. LANGUAGE MODEL ERRORS
84-
These errors result from the model's inability to correctly interpret or reason about the task at a higher level,
85-
independent of the low-level web interactions.
86-
87-
- Task Understanding Errors
88-
The agent misreads or misunderstands the user's objective (goal interpretation),
89-
loses crucial context (context loss), or performs actions beyond or short of the intended scope.
90-
91-
- Reasoning Failures
92-
The agent's logic is flawed (logical inference errors), behaves inconsistently across multiple steps,
93-
or fails to prioritize important subtasks when handling complex goals.
94-
95-
3. BENCHMARK & ENVIRONMENT ERRORS
96-
These errors are external to the agent's logic and the language model's reasoning,
97-
arising from flaws in the system, network, or evaluation framework itself.
98-
99-
- System Errors
100-
Network failures, API downtime, or dynamic web changes that break the agent's assumptions (e.g., layout shifts).
101-
102-
- Benchmark Design Errors
103-
Ambiguous or contradictory task specifications, incorrect validation criteria (where correct solutions are flagged as failures),
104-
or inflexible evaluation systems that fail to account for valid alternative solutions.
84+
6. Reasoning Failures
85+
The agent's logic is flawed (logical inference errors), behaves inconsistently across multiple steps,
86+
or fails to prioritize important subtasks when handling complex goals.
10587
10688
--------------------------------------------------------------------------------
10789
INPUT DESCRIPTION
@@ -124,34 +106,19 @@
124106
FEW-SHOT CLASSIFICATION EXAMPLES
125107
--------------------------------------------------------------------------------
126108
127-
1) EXAMPLE A (Benchmark Error - Benchmark Design Error)
128-
• Context: The agent correctly finds a cheaper product meeting the user's criteria,
129-
but the benchmark expects a more expensive product and marks the solution as wrong.
130-
• Classification: ["Benchmark Design Error"]
131-
• Justification: The agent's solution is objectively valid, but the evaluation framework is too rigid
132-
and does not allow an alternative correct solution.
133-
134-
2) EXAMPLE B (Agent Error - Interaction Execution)
109+
1) EXAMPLE A (Interaction Execution)
135110
• Context: The agent repeatedly clicks "Show report" after entering dates in the wrong format.
136111
Each time, the site resets to default dates. The agent never notices and keeps doing the same thing.
137-
• Classification: ["Agent Error - Interaction Execution"]
112+
• Classification: ["Interaction Execution"]
138113
• Justification: The agent used an invalid input format ("Format Errors"), then repeated the failing action
139114
without adaptation ("Action Repetition").
140115
141-
3) EXAMPLE C (Benchmark Error - Benchmark Design Error)
142-
• Context: The user asks, "Where is the nearest In-N-Out to Upitts?"
143-
The query is ambiguous because "Upitts" is not a standard location.
144-
The agent flounders, eventually returning "No In-N-Out found," which is incorrect for the region.
145-
• Classification: ["Benchmark Design Error"]
146-
• Justification: The task goal is poorly specified ("Upitts" is ambiguous or unrealistic),
147-
leading the agent astray due to unclear context.
148-
149-
4) EXAMPLE D (Language Model Error - Task Understanding)
116+
2) EXAMPLE B (Task Understanding)
150117
• Context: The user says, "In the repository myorg/myrepo, locate any issues labeled 'help wanted'
151118
that are older than 30 days and add a comment saying 'I can help fix this.'"
152119
The agent's planning notes mention searching for existing issues but quickly pivot to creating a brand-new issue
153120
with label 'help wanted,' ignoring the user's actual request to find and comment on old issues.
154-
• Classification: ["Language Model Error - Task Understanding"]
121+
• Classification: ["Task Understanding"]
155122
• Justification: The agent misunderstood the user's goal. Instead of searching for and commenting on existing issues,
156123
it focused on creating a new issue. This is a misinterpretation of the instructions,
157124
not a mechanical error in clicking or input format.
@@ -166,23 +133,28 @@
166133
- The current HTML or AX Tree observation
167134
- The user goal
168135
169-
2. Decide if the failure is:
170-
- An Agent Error (which subcategory/subcategories),
171-
- A Language Model Error (which subcategory/subcategories),
172-
- A Benchmark/Environment Error (which subcategory/subcategories),
173-
- Or a combination thereof (multi-label if needed).
136+
2. In case you think the task was unsuccessful, decide the category, or a combination thereof, under which the reason for failure lies.
137+
If the task is successful, you can keep the error category as blank.
174138
175139
3. Provide a brief explanation justifying your classification, referencing specific steps if helpful.
176140
177-
4. If the agent succeeds (no error), label the errorCategory accordingly as "Success".
141+
Output format example for an unsuccessful interaction:
142+
{{
143+
"explanation": "The agent opened the wrong GitLab page and never recovered...",
144+
"success": False,
145+
"errorCategory": ["Navigation & Planning"],
146+
}}
178147
179-
Output Format Example:
148+
Output format example for a successful interaction:
180149
{{
181-
"errorCategory": ["Agent Error - Navigation & Planning"],
182-
"explanation": "The agent opened the wrong GitLab page and never recovered..."
150+
"explanation": "The agent opened the correct GitLab page and ...",
151+
"success": True,
152+
"errorCategory": [],
183153
}}
184154
185-
Please follow this structure at every step. Keep your responses concise and clear. Below are the details.
155+
Please follow this structure at every step. Keep your responses concise and clear.
156+
157+
Below are the details for the interaction.
186158
187159
Overall goal: {goal}
188160

0 commit comments

Comments
 (0)