|
53 | 53 |
|
54 | 54 | ERROR_CLASSIFICATION_PROMPT = """ |
55 | 55 | You are an expert evaluator that classifies web agent failures according to a predefined taxonomy. |
56 | | -Below are the high-level definitions of each top-level category (Agent Errors, Language Model Errors, and Benchmark/Environment Errors), |
57 | | -followed by an explanation of the inputs you will receive (planning history, chain of thought, etc.), |
| 56 | +Below are the high-level definitions of each category, |
| 57 | +followed by an explanation of the inputs of the interaction you will receive (planning history, chain of thought, etc.), |
58 | 58 | a set of labeled examples for reference (few-shot), and finally the classification task you must complete. |
59 | 59 |
|
60 | 60 | -------------------------------------------------------------------------------- |
61 | 61 | TAXONOMY DEFINITIONS |
62 | 62 | -------------------------------------------------------------------------------- |
63 | 63 |
|
64 | | -1. AGENT ERRORS |
65 | | -These errors arise when agents interact with web interfaces and fail due to limitations in perception, navigation, or manipulation. |
| 64 | +1. Navigation & Planning Errors |
| 65 | + The agent cannot construct or execute a correct sequence of actions to reach its goal |
| 66 | + (e.g., getting lost on a website, failing to recover from missteps, or using incorrect search terms). |
66 | 67 |
|
67 | | - - Navigation & Planning Errors |
68 | | - The agent cannot construct or execute a correct sequence of actions to reach its goal |
69 | | - (e.g., getting lost on a website, failing to recover from missteps, or using incorrect search terms). |
| 68 | +2. Interaction Execution Errors |
| 69 | + The agent enters data in the wrong format, forgets to click "Submit" after typing, |
| 70 | + repeats the same failing action without adaptation, or loses track of the changing webpage state. |
70 | 71 |
|
71 | | - - Interaction Execution Errors |
72 | | - The agent enters data in the wrong format, forgets to click "Submit" after typing, |
73 | | - repeats the same failing action without adaptation, or loses track of the changing webpage state. |
| 72 | +3. Information Processing Errors |
| 73 | + The agent misreads or misinterprets visible data (e.g., extracting the wrong field values), |
| 74 | + misconstrues relationships between pieces of information, or fails to validate data against task requirements. |
74 | 75 |
|
75 | | - - Information Processing Errors |
76 | | - The agent misreads or misinterprets visible data (e.g., extracting the wrong field values), |
77 | | - misconstrues relationships between pieces of information, or fails to validate data against task requirements. |
| 76 | +4. Observation & Action Errors |
| 77 | + The agent fails to observe important updates in the environment (e.g., not noticing the page reloaded) |
| 78 | + or misaligns its actions (clicks the wrong element or stale link). |
78 | 79 |
|
79 | | - - Observation & Action Errors |
80 | | - The agent fails to observe important updates in the environment (e.g., not noticing the page reloaded) |
81 | | - or misaligns its actions (clicks the wrong element or stale link). |
| 80 | +5. Task Understanding Errors |
| 81 | + The agent misreads or misunderstands the user's objective (goal interpretation), |
| 82 | + loses crucial context (context loss), or performs actions beyond or short of the intended scope. |
82 | 83 |
|
83 | | -2. LANGUAGE MODEL ERRORS |
84 | | -These errors result from the model's inability to correctly interpret or reason about the task at a higher level, |
85 | | -independent of the low-level web interactions. |
86 | | -
|
87 | | - - Task Understanding Errors |
88 | | - The agent misreads or misunderstands the user's objective (goal interpretation), |
89 | | - loses crucial context (context loss), or performs actions beyond or short of the intended scope. |
90 | | -
|
91 | | - - Reasoning Failures |
92 | | - The agent's logic is flawed (logical inference errors), behaves inconsistently across multiple steps, |
93 | | - or fails to prioritize important subtasks when handling complex goals. |
94 | | -
|
95 | | -3. BENCHMARK & ENVIRONMENT ERRORS |
96 | | -These errors are external to the agent's logic and the language model's reasoning, |
97 | | -arising from flaws in the system, network, or evaluation framework itself. |
98 | | -
|
99 | | - - System Errors |
100 | | - Network failures, API downtime, or dynamic web changes that break the agent's assumptions (e.g., layout shifts). |
101 | | -
|
102 | | - - Benchmark Design Errors |
103 | | - Ambiguous or contradictory task specifications, incorrect validation criteria (where correct solutions are flagged as failures), |
104 | | - or inflexible evaluation systems that fail to account for valid alternative solutions. |
| 84 | +6. Reasoning Failures |
| 85 | + The agent's logic is flawed (logical inference errors), behaves inconsistently across multiple steps, |
| 86 | + or fails to prioritize important subtasks when handling complex goals. |
105 | 87 |
|
106 | 88 | -------------------------------------------------------------------------------- |
107 | 89 | INPUT DESCRIPTION |
|
124 | 106 | FEW-SHOT CLASSIFICATION EXAMPLES |
125 | 107 | -------------------------------------------------------------------------------- |
126 | 108 |
|
127 | | -1) EXAMPLE A (Benchmark Error - Benchmark Design Error) |
128 | | - • Context: The agent correctly finds a cheaper product meeting the user's criteria, |
129 | | - but the benchmark expects a more expensive product and marks the solution as wrong. |
130 | | - • Classification: ["Benchmark Design Error"] |
131 | | - • Justification: The agent's solution is objectively valid, but the evaluation framework is too rigid |
132 | | - and does not allow an alternative correct solution. |
133 | | -
|
134 | | -2) EXAMPLE B (Agent Error - Interaction Execution) |
| 109 | +1) EXAMPLE A (Interaction Execution) |
135 | 110 | • Context: The agent repeatedly clicks "Show report" after entering dates in the wrong format. |
136 | 111 | Each time, the site resets to default dates. The agent never notices and keeps doing the same thing. |
137 | | - • Classification: ["Agent Error - Interaction Execution"] |
| 112 | + • Classification: ["Interaction Execution"] |
138 | 113 | • Justification: The agent used an invalid input format ("Format Errors"), then repeated the failing action |
139 | 114 | without adaptation ("Action Repetition"). |
140 | 115 |
|
141 | | -3) EXAMPLE C (Benchmark Error - Benchmark Design Error) |
142 | | - • Context: The user asks, "Where is the nearest In-N-Out to Upitts?" |
143 | | - The query is ambiguous because "Upitts" is not a standard location. |
144 | | - The agent flounders, eventually returning "No In-N-Out found," which is incorrect for the region. |
145 | | - • Classification: ["Benchmark Design Error"] |
146 | | - • Justification: The task goal is poorly specified ("Upitts" is ambiguous or unrealistic), |
147 | | - leading the agent astray due to unclear context. |
148 | | -
|
149 | | -4) EXAMPLE D (Language Model Error - Task Understanding) |
| 116 | +2) EXAMPLE B (Task Understanding) |
150 | 117 | • Context: The user says, "In the repository myorg/myrepo, locate any issues labeled 'help wanted' |
151 | 118 | that are older than 30 days and add a comment saying 'I can help fix this.'" |
152 | 119 | The agent's planning notes mention searching for existing issues but quickly pivot to creating a brand-new issue |
153 | 120 | with label 'help wanted,' ignoring the user's actual request to find and comment on old issues. |
154 | | - • Classification: ["Language Model Error - Task Understanding"] |
| 121 | + • Classification: ["Task Understanding"] |
155 | 122 | • Justification: The agent misunderstood the user's goal. Instead of searching for and commenting on existing issues, |
156 | 123 | it focused on creating a new issue. This is a misinterpretation of the instructions, |
157 | 124 | not a mechanical error in clicking or input format. |
|
166 | 133 | - The current HTML or AX Tree observation |
167 | 134 | - The user goal |
168 | 135 |
|
169 | | -2. Decide if the failure is: |
170 | | - - An Agent Error (which subcategory/subcategories), |
171 | | - - A Language Model Error (which subcategory/subcategories), |
172 | | - - A Benchmark/Environment Error (which subcategory/subcategories), |
173 | | - - Or a combination thereof (multi-label if needed). |
| 136 | +2. In case you think the task was unsuccessful, decide the category, or a combination thereof, under which the reason for failure lies. |
| 137 | + If the task is successful, you can keep the error category as blank. |
174 | 138 |
|
175 | 139 | 3. Provide a brief explanation justifying your classification, referencing specific steps if helpful. |
176 | 140 |
|
177 | | -4. If the agent succeeds (no error), label the errorCategory accordingly as "Success". |
| 141 | +Output format example for an unsuccessful interaction: |
| 142 | +{{ |
| 143 | + "explanation": "The agent opened the wrong GitLab page and never recovered...", |
| 144 | + "success": False, |
| 145 | + "errorCategory": ["Navigation & Planning"], |
| 146 | +}} |
178 | 147 |
|
179 | | -Output Format Example: |
| 148 | +Output format example for a successful interaction: |
180 | 149 | {{ |
181 | | - "errorCategory": ["Agent Error - Navigation & Planning"], |
182 | | - "explanation": "The agent opened the wrong GitLab page and never recovered..." |
| 150 | + "explanation": "The agent opened the correct GitLab page and ...", |
| 151 | + "success": True, |
| 152 | + "errorCategory": [], |
183 | 153 | }} |
184 | 154 |
|
185 | | -Please follow this structure at every step. Keep your responses concise and clear. Below are the details. |
| 155 | +Please follow this structure at every step. Keep your responses concise and clear. |
| 156 | +
|
| 157 | +Below are the details for the interaction. |
186 | 158 |
|
187 | 159 | Overall goal: {goal} |
188 | 160 |
|
|
0 commit comments