You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
prompt+="In order to meet the user\'s requirements, you need to select one of the following operations to operate on the current screen:\n"
186
-
prompt+="Note that to open an app, use the Open App action, rather than tapping the app's icon. "
187
-
prompt+="For certain items that require selection, such as font and font size, direct input is more efficient than scrolling through choices."
188
+
189
+
# 根据是否有perception info,提供不同的坐标说明
190
+
ifclickable_infosandlen(clickable_infos) >0:
191
+
prompt+="Note: The coordinates in the ### Screenshot information ### section are in pixel format [x, y]. When you output Tap actions, use the same pixel coordinates.\n"
192
+
else:
193
+
prompt+="Note: Since no extracted information is provided, you need to directly analyze the screenshot and output normalized coordinates.\n"
194
+
prompt+="For Tap actions, use normalized coordinates where x and y are in the range [0, 999], with (0, 0) at the top-left corner and (999, 999) at the bottom-right corner.\n"
195
+
196
+
prompt+="For certain items that require selection, such as font and font size, direct input is more efficient than scrolling through choices.\n"
188
197
prompt+="You must choose one of the actions below:\n"
189
-
prompt+="Open App (app name): If you want to open an app, you should use this action to open the app named 'app name'."
198
+
# prompt += "Open App (app name): If you want to open an app, you should use this action to open the app named 'app name'."
190
199
prompt+="Right Tap (x, y): Right tap the position (x, y) in current page. This can be used to create a new file.\n"
191
200
prompt+="Tap (x, y): Tap the position (x, y) in current page. This can be used to select an item.\n"
192
201
prompt+="Double Tap (x, y): Double tap the position (x, y) in the current page. This can be used to open a file. If Tap (x, y) in the last step doesn't work, you can try double tap the position (x, y) in the current page.\n"
prompt+="Append (x, y), (text): Append the \"text\" content after the content at (x, y) location. This action is useful when you want to append new content into a word document.\n"
216
225
217
226
prompt+="Tell (answer): Tell me the answer of the input query.\n"
218
-
prompt+="Stop: If all the operations to meet the user\'s requirements have been completed in ### History operation ###, use this operation to stop the whole process."
227
+
prompt+="Stop: ONLY use this action when you can VERIFY from the CURRENT SCREENSHOT that ALL requirements in the user's instruction have been ACTUALLY COMPLETED. Do NOT stop just because you performed some operations - you must verify the final result is achieved on the screen."
219
228
prompt+="\n\n"
220
229
221
230
prompt+="### Output format ###\n"
222
231
# modified 2.10
223
232
prompt+="You should output in the following json format:"
233
+
# prompt += '''
234
+
# {"Thought": "This is your thinking about how to proceed the next operation, please output the thoughts about the history operations explicitly.", "Action": "Open App () or Tap () or Double Tap () or Triple Tap () or Shortcut () or Press() or Type () or Tell () or Stop. Only one action can be output at one time.", "Summary": "This is a one sentence summary of this operation."}
235
+
# '''
224
236
prompt+='''
225
-
{"Thought": "This is your thinking about how to proceed the next operation, please output the thoughts about the history operations explicitly.", "Action": "Open App () or Tap () or Double Tap () or Triple Tap () or Shortcut () or Press() or Type () or Tell () or Stop. Only one action can be output at one time.", "Summary": "This is a one sentence summary of this operation."}
226
-
'''
237
+
{"Thought": "This is your thinking about how to proceed the next operation, please output the thoughts about the history operations explicitly.", "Action": "Tap () or Double Tap () or Triple Tap () or Shortcut () or Press() or Type () or Tell () or Stop. Only one action can be output at one time.", "Summary": "This is a one sentence summary of this operation."}
238
+
'''
239
+
prompt+="The output must contain the following fields: Thought (your reasoning about the next operation), Action (the specific action to take), and Summary (a one-sentence summary of the operation)."
227
240
prompt+="\n\n"
228
241
229
242
@@ -243,19 +256,22 @@ def get_reflect_prompt(instruction, clickable_infos1, clickable_infos2, width, h
243
256
prompt+="The format of the coordinates is [x, y], x is the pixel from left to right and y is the pixel from top to bottom; the content is a text or an icon description respectively "
prompt+=f"The user\'s instruction is: {instruction}."
@@ -271,16 +287,19 @@ def get_reflect_prompt(instruction, clickable_infos1, clickable_infos2, width, h
271
287
prompt+="Now you need to output the following content based on the screenshots information before and after the current operation:\n"
272
288
else:
273
289
prompt+="Now you need to output the following content based on the screenshots before and after the current operation:\n"
274
-
prompt+="Whether the result of the \"Operation action\" meets your expectation of \"Operation thought\"?\n"
275
-
prompt+="A: The result of the \"Operation action\" meets my expectation of \"Operation thought\".\n"
290
+
prompt+="1. Whether the result of the \"Operation action\" meets your expectation of \"Operation thought\"?\n"
291
+
prompt+="2. IMPORTANT: By carefully examining the screenshot after the operation, verify if the actual goal described in the user's instruction is achieved.\n"
292
+
prompt+="Choose one of the following:\n"
293
+
prompt+="A: The result of the \"Operation action\" meets my expectation of \"Operation thought\" AND the actual goal in the instruction is achieved based on the current screenshot.\n"
276
294
prompt+="B: The \"Operation action\" results in a wrong page and I need to do something to correct this.\n"
277
-
prompt+="C: The \"Operation action\" produces no changes."
295
+
prompt+="C: The \"Operation action\" produces no changes.\n"
296
+
prompt+="D: The \"Operation action\" seems to complete, but the actual goal in the instruction is NOT achieved based on the current screenshot (e.g., clicked wrong position, wrong item selected)."
278
297
prompt+="\n\n"
279
298
280
299
prompt+="### Output format ###\n"
281
300
prompt+="Your output format is:\n"
282
-
prompt+="### Thought ###\nYour thought about the question\n"
283
-
prompt+="### Answer ###\nA or B or C"
301
+
prompt+="### Thought ###\nYour thought about the question. Please explicitly verify if the goal in the instruction is achieved by checking the screenshot.\n"
prompt+="Now you need to update the \"Completed contents\". Completed contents is a general summary of the current contents that have been completed based on the ### History operations ###.\n\n"
362
+
prompt+="Now you need to update the \"Completed contents\" by comparing the user's instruction with the current screenshot.\n"
363
+
prompt+="IMPORTANT: You must verify if the actual goal is achieved by checking the current screenshot information, not just assuming based on operation history.\n"
364
+
prompt+="For example, if the instruction is to 'play 稻香', you need to verify if 稻香 is actually playing on the screen, not just because you clicked something.\n\n"
334
365
335
366
prompt+="### Output format ###\n"
336
367
prompt+="Your output format is:\n"
337
-
prompt+="### Completed contents ###\nUpdated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### History operations ###."
368
+
prompt+="### Completed contents ###\nUpdated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed AND VERIFIED on the current screenshot."
# prompt += "A reflection model was adopted to analyze whether the last step's operation meets the expectation, you should combine its reflection thought to produce the \"Completed contents\"."
351
-
# prompt += "Below is its reflection thought:\n"
352
-
# prompt += reflection_thought + "\n"
353
379
354
380
prompt+="### Response requirements ###\n"
355
381
prompt+="Now you need to combine all of the above to generate the \"Completed contents\".\n"
382
+
prompt+="IMPORTANT: You must verify if the actual goal is achieved by checking the current screenshot information, not just assuming based on operation.\n"
356
383
prompt+="Completed contents is a general summary of the current contents that have been completed. You need to first focus on the requirements of user\'s instruction, and then summarize the contents that have been completed.\n\n"
357
384
358
385
prompt+="### Output format ###\n"
359
386
prompt+="Your output format is:\n"
360
-
prompt+="### Completed contents ###\nGenerated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### Current operation ###.\n"
387
+
prompt+="### Completed contents ###\nGenerated Completed contents. Don\'t output the purpose of any operation. Just summarize the contents that have been actually completed AND VERIFIED on the current screenshot.\n"
0 commit comments