|
50 | 50 | Action: {action} |
51 | 51 | """ |
52 | 52 |
|
| 53 | +ERROR_CLASSIFICATION_PROMPT_SUCCESS_OR_NOT = """ |
| 54 | +You are an expert evaluator that classifies web agent failures according to a predefined taxonomy. |
| 55 | +Below are the high-level definitions of each category, |
| 56 | +followed by an explanation of the inputs of the interaction you will receive (planning history, chain of thought, etc.), |
| 57 | +a set of labeled examples for reference (few-shot), and finally the classification task you must complete. |
| 58 | +
|
| 59 | +-------------------------------------------------------------------------------- |
| 60 | +TAXONOMY DEFINITIONS |
| 61 | +-------------------------------------------------------------------------------- |
| 62 | +
|
| 63 | +1. Navigation & Planning Errors |
| 64 | + The agent cannot construct or execute a correct sequence of actions to reach its goal |
| 65 | + (e.g., getting lost on a website, failing to recover from missteps, or using incorrect search terms). |
| 66 | +
|
| 67 | +2. Interaction Execution Errors |
| 68 | + The agent enters data in the wrong format, forgets to click "Submit" after typing, |
| 69 | + repeats the same failing action without adaptation, or loses track of the changing webpage state. |
| 70 | +
|
| 71 | +3. Information Processing Errors |
| 72 | + The agent misreads or misinterprets visible data (e.g., extracting the wrong field values), |
| 73 | + misconstrues relationships between pieces of information, or fails to validate data against task requirements. |
| 74 | +
|
| 75 | +4. Observation & Action Errors |
| 76 | + The agent fails to observe important updates in the environment (e.g., not noticing the page reloaded) |
| 77 | + or misaligns its actions (clicks the wrong element or stale link). |
| 78 | +
|
| 79 | +5. Task Understanding Errors |
| 80 | + The agent misreads or misunderstands the user's objective (goal interpretation), |
| 81 | + loses crucial context (context loss), or performs actions beyond or short of the intended scope. |
| 82 | +
|
| 83 | +6. Reasoning Failures |
| 84 | + The agent's logic is flawed (logical inference errors), behaves inconsistently across multiple steps, |
| 85 | + or fails to prioritize important subtasks when handling complex goals. |
| 86 | +
|
| 87 | +-------------------------------------------------------------------------------- |
| 88 | +INPUT DESCRIPTION |
| 89 | +-------------------------------------------------------------------------------- |
| 90 | +
|
| 91 | +You will receive the following for each scenario: |
| 92 | +1. User Goal |
| 93 | + - The original objective provided by the user (e.g., "Open a GitLab issue labeled 'help wanted'"). |
| 94 | +
|
| 95 | +2. Historical change summaries |
| 96 | + - A list of summaries of changes in the observation that the agent has seen during the course of actions. |
| 97 | +
|
| 98 | +3. Action History |
| 99 | + - A record of the agent's step-by-step actions in the web environment (clicks, form entries, navigations, etc.) |
| 100 | + along with immediate outcomes or errors. |
| 101 | +
|
| 102 | +Using these inputs, you must categorize the observed failure (or success) under the appropriate category or categories. |
| 103 | +
|
| 104 | +-------------------------------------------------------------------------------- |
| 105 | +FEW-SHOT CLASSIFICATION EXAMPLES |
| 106 | +-------------------------------------------------------------------------------- |
| 107 | +
|
| 108 | +1) EXAMPLE A (Interaction Execution) |
| 109 | + • Context: The agent repeatedly clicks "Show report" after entering dates in the wrong format. |
| 110 | + Each time, the site resets to default dates. The agent never notices and keeps doing the same thing. |
| 111 | + • Classification: ["Interaction Execution"] |
| 112 | + • Justification: The agent used an invalid input format ("Format Errors"), then repeated the failing action |
| 113 | + without adaptation ("Action Repetition"). |
| 114 | +
|
| 115 | +2) EXAMPLE B (Task Understanding) |
| 116 | + • Context: The user says, "In the repository myorg/myrepo, locate any issues labeled 'help wanted' |
| 117 | + that are older than 30 days and add a comment saying 'I can help fix this.'" |
| 118 | + The agent's planning notes mention searching for existing issues but quickly pivot to creating a brand-new issue |
| 119 | + with label 'help wanted,' ignoring the user's actual request to find and comment on old issues. |
| 120 | + • Classification: ["Task Understanding"] |
| 121 | + • Justification: The agent misunderstood the user's goal. Instead of searching for and commenting on existing issues, |
| 122 | + it focused on creating a new issue. This is a misinterpretation of the instructions, |
| 123 | + not a mechanical error in clicking or input format. |
| 124 | +
|
| 125 | +-------------------------------------------------------------------------------- |
| 126 | +CLASSIFICATION TASK |
| 127 | +-------------------------------------------------------------------------------- |
| 128 | +
|
| 129 | +1. Read through: |
| 130 | + - The planning and thought history |
| 131 | + - The action history |
| 132 | + - The current HTML or AX Tree observation |
| 133 | + - The user goal |
| 134 | +
|
| 135 | +2. In case you think the task was unsuccessful, decide the category, or a combination thereof, under which the reason for failure lies. |
| 136 | + If the task is successful, you can keep the error category as blank. |
| 137 | +
|
| 138 | +3. Provide a brief explanation justifying your classification, referencing specific steps if helpful. |
| 139 | +
|
| 140 | +Output format example for an unsuccessful interaction: |
| 141 | +
|
| 142 | +<explanation>The agent opened the wrong GitLab page and never recovered...</explanation> |
| 143 | +<success>False</success> |
| 144 | +<errorCategory>["Navigation & Planning"]</errorCategory> |
| 145 | +
|
| 146 | +Output format example for a successful interaction: |
| 147 | +
|
| 148 | +<explanation>The agent opened the correct GitLab page and ...</explanation> |
| 149 | +<success>True</success> |
| 150 | +<errorCategory>[]</errorCategory> |
| 151 | + |
| 152 | +Please follow this structure at every step. Keep your responses concise and clear. |
| 153 | +
|
| 154 | +Below are the details for the interaction. Extra information yields additional information from the environment. It might not always be present or relevant. |
| 155 | +
|
| 156 | +Overall goal: {goal} |
| 157 | +
|
| 158 | +Historical change summaries: {historical_summaries} |
| 159 | +
|
| 160 | +Action history: {action_history} |
| 161 | +
|
| 162 | +Extra information: {extra_info} |
| 163 | +""" |
| 164 | + |
| 165 | + |
53 | 166 | ERROR_CLASSIFICATION_PROMPT = """ |
54 | 167 | You are an expert evaluator that classifies web agent failures according to a predefined taxonomy. |
55 | 168 | Below are the high-level definitions of each category, |
|
0 commit comments