Skip to content

Commit 5d92fee

Browse files
authored
Merge pull request #15 from njucckevin/pr-fix-env-qwen3vl
Add AndroidWorld env module and AndroidWorld Qwen3-VL evaluation
2 parents 229ea8d + ba8106a commit 5d92fee

25 files changed

+5876
-7
lines changed

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,4 +217,7 @@ playground/.env
217217

218218
!launcher/**
219219
multi_app_tasks_backup
220-
multi_app_tasks
220+
multi_app_tasks
221+
222+
!evaluation/AndroidWorld/android_world/env/
223+
!evaluation/AndroidWorld/android_world/env/**/*

evaluation/AndroidWorld/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,19 @@ Please refer to [AndroidWorld README](docs/README_AndroidWorld.md)
2424
* **\[CONSOLE\_PORT]** is the port for the agent’s console
2525
* **\[CHECKPOINT\_DIR]** is the path to the directory containing your model checkpoints
2626
* **\[GRPC\_PORT]** is the port for the gRPC service
27+
28+
## Qwen3-VL Model Evaluation
29+
30+
We have adapted the prompts and action space of the Qwen3-VL series models to reproduce their evaluation results.
31+
32+
1. **Launch the Android emulator first (example):**
33+
```bash
34+
emulator -avd AndroidWorldAVD -no-snapshot -grpc 8554
35+
```
36+
37+
2. **After deploying your model API with `vLLM` (refer to [model development](../README.md#-model-development)), configure the `model_url` and `model_name`, e.g., `http://<ip>:8000/v1` and `Qwen3-VL-8B-Instruct`.**
38+
39+
3. **Run the evaluation using the following script (example):**
40+
```
41+
python run.py --agent_name qwen3vl --console_port 5554 --grpc_port 8554 --perform_emulator_setup=true --qwen3vl_model_base_url model_url --qwen3vl_model_name model_name --qwen3vl_model_api_key EMPTY --checkpoint_dir runs/qwen3vl_8b_instruct
42+
```

evaluation/AndroidWorld/android_world/agents/PROMPT.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,14 @@
7979
"""
8080
)
8181

82+
# =========================
83+
# Qwen3VL tool-call prompts
84+
# =========================
85+
86+
QWEN3VL_SYSTEM_PROMPT = "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"mobile_use\", \"description\": \"Use a touchscreen to interact with a mobile device, and take screenshots.\\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\\n* The screen's resolution is 999x999.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `click`: Click the point on the screen with coordinate (x, y).\\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\\n* `type`: Input the specified text into the activated input box.\\n* `answer`: Output the answer.\\n* `system_button`: Press the system button.\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\", \"enum\": [\"click\", \"long_press\", \"swipe\", \"type\", \"answer\", \"system_button\", \"wait\", \"terminate\"], \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.\", \"type\": \"array\"}, \"coordinate2\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type` and `action=answer`.\", \"type\": \"string\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=long_press` and `action=wait`.\", \"type\": \"number\"}, \"button\": {\"description\": \"Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`\", \"enum\": [\"Back\", \"Home\", \"Menu\", \"Enter\"], \"type\": \"string\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Thought: one concise sentence explaining the next move (no multi-step reasoning).\n2) Action: a short imperative describing what to do in the UI.\n3) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Thought, Action, <tool_call>.\n- Be brief: one sentence for Thought, one for Action.\n- Do not output anything else outside those three parts.\n- If finishing, use action=terminate in the tool call."
87+
88+
QWEN3VL_USER_PROMPT = "The user query: {instruction}.\nTask progress (You have done the following operation on the current device): {history}.\n"
89+
8290

8391
SUMMARY_PROMPT_TEMPLATE = (
8492
PROMPT_PREFIX

evaluation/AndroidWorld/android_world/agents/seeact_v.py

Lines changed: 189 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
import time
1919

2020
import ast
21+
import json
2122
import numpy as np
2223
from PIL import Image
2324
from openai import OpenAI
@@ -40,7 +41,13 @@
4041
from android_world.env import interface
4142
from android_world.env import json_action
4243
from android_world.env import representation_utils
43-
from transformers.models.qwen2_vl.image_processing_qwen2_vl_fast import smart_resize
44+
45+
try:
46+
from transformers.models.qwen2_vl.image_processing_qwen2_vl_fast import ( # type: ignore
47+
smart_resize,
48+
)
49+
except Exception: # pragma: no cover
50+
smart_resize = None # type: ignore[assignment]
4451

4552
# Utils for Visual Grounding
4653

@@ -932,3 +939,184 @@ def _to_base64_png(image: np.ndarray) -> str:
932939
buf = BytesIO()
933940
PILImage.fromarray(image).save(buf, format="PNG")
934941
return f"data:image/png;base64,{base64.b64encode(buf.getvalue()).decode()}"
942+
943+
944+
def _extract_action_text_qwen3vl(block: str) -> str:
945+
"""Extracts the 'Action:' line from Qwen3VL text output for step history (does not affect execution)."""
946+
m = re.search(r"Action:\s*(.+?)(?:\n<tool_call>|$)", block, flags=re.S)
947+
if not m:
948+
return ""
949+
text = m.group(1).strip()
950+
# Some models wrap Action: "..." with quotes.
951+
if text.startswith('"') and text.endswith('"'):
952+
text = text[1:-1]
953+
return text.replace("\n", " ")
954+
955+
956+
def _parse_tool_call_json(block: str) -> dict[str, Any] | None:
957+
"""Parse JSON inside <tool_call>...</tool_call>."""
958+
m = re.search(r"<tool_call>\s*([\s\S]*?)\s*</tool_call>", block)
959+
if not m:
960+
return None
961+
payload = m.group(1).strip()
962+
try:
963+
return json.loads(payload)
964+
except Exception:
965+
return None
966+
967+
968+
class Qwen3VL(base_agent.EnvironmentInteractingAgent):
969+
"""Android GUI Agent based on Qwen3VL tool-call output (for AndroidWorld eval).
970+
971+
- Input: Screenshot + instruction + history
972+
- Output: <tool_call>{...}</tool_call>
973+
- Execution: Map to JSONAction by qwen3vl_action_transform(...)
974+
"""
975+
976+
def __init__(
977+
self,
978+
env: interface.AsyncEnv,
979+
llm: infer.MultimodalLlmWrapper,
980+
name: str = "Qwen3VL",
981+
wait_after_action_seconds: float = 2.0,
982+
model_base_url: str = "http://127.0.0.1:8000/v1",
983+
model_api_key: str = "EMPTY",
984+
model_name: str = "",
985+
extra_headers: dict[str, str] | None = None,
986+
):
987+
super().__init__(env, name)
988+
self.llm = llm
989+
self.wait_after_action_seconds = wait_after_action_seconds
990+
self.model_name = model_name
991+
self.client = OpenAI(
992+
api_key=model_api_key,
993+
base_url=model_base_url,
994+
default_headers=extra_headers,
995+
)
996+
self.step_his: str = ""
997+
self.turn_number: int = 0
998+
# Used to detect repeated actions (avoid infinite loops)
999+
self.last_action: str | None = None
1000+
self.repeat_time: int = 0
1001+
1002+
def reset(self, go_home_on_reset: bool = False):
1003+
super().reset(go_home_on_reset)
1004+
self.env.hide_automation_ui()
1005+
self.step_his = ""
1006+
self.turn_number = 0
1007+
self.last_action = None
1008+
self.repeat_time = 0
1009+
1010+
@staticmethod
1011+
def _to_base64_png(image: np.ndarray) -> str:
1012+
import base64
1013+
from io import BytesIO
1014+
from PIL import Image as PILImage
1015+
buf = BytesIO()
1016+
PILImage.fromarray(image).save(buf, format='PNG')
1017+
return f"data:image/png;base64,{base64.b64encode(buf.getvalue()).decode()}"
1018+
1019+
def step(self, instruction: str) -> base_agent.AgentInteractionResult:
1020+
self.turn_number += 1
1021+
1022+
state = self.get_post_transition_state()
1023+
screenshot = state.pixels.copy()
1024+
# To be consistent with other agents in this file: BGR->RGB (for saving/encoding)
1025+
screenshot = screenshot[:, :, ::-1]
1026+
height, width = screenshot.shape[:2]
1027+
1028+
system_prompt = QWEN3VL_SYSTEM_PROMPT
1029+
user_prompt = QWEN3VL_USER_PROMPT.format(
1030+
instruction=instruction, history=self.step_his
1031+
)
1032+
1033+
messages = [
1034+
{
1035+
"role": "system",
1036+
"content": [{"type": "text", "text": system_prompt}],
1037+
},
1038+
{
1039+
"role": "user",
1040+
"content": [
1041+
{"type": "text", "text": user_prompt},
1042+
{"type": "image_url", "image_url": {"url": self._to_base64_png(screenshot)}},
1043+
],
1044+
},
1045+
]
1046+
1047+
completion = self.client.chat.completions.create(
1048+
model=self.model_name,
1049+
messages=messages,
1050+
temperature=0,
1051+
)
1052+
response = completion.choices[0].message.content or ""
1053+
print(response)
1054+
print("=" * 50)
1055+
1056+
tool_call = _parse_tool_call_json(response)
1057+
if not tool_call:
1058+
return base_agent.AgentInteractionResult(
1059+
True, {"summary": "No <tool_call> JSON found in model output.", "response": response}
1060+
)
1061+
1062+
op_text = _extract_action_text_qwen3vl(response)
1063+
if op_text:
1064+
self.step_his += f"Step {self.turn_number}: {op_text}; "
1065+
1066+
# Compatible: tool_call may look like {"name":"mobile_use","arguments":{...}}
1067+
args = tool_call.get("arguments", {}) if isinstance(tool_call, dict) else {}
1068+
action_name = args.get("action", "")
1069+
try:
1070+
parsed = qwen3vl_action_transform(action_name, args, width, height)
1071+
print(parsed)
1072+
except Exception as e:
1073+
return base_agent.AgentInteractionResult(
1074+
True,
1075+
{
1076+
"summary": f"Failed to transform tool-call into action: {e}",
1077+
"response": response,
1078+
"tool_call": tool_call,
1079+
},
1080+
)
1081+
1082+
# Record last_action + repeat_time (previous code had these fields but not working)
1083+
# Here, use the tool-call's arguments as the "action signature", which is more robust than checking 'terminate' in a string.
1084+
try:
1085+
action_sig = json.dumps(args, ensure_ascii=False, sort_keys=True)
1086+
except Exception:
1087+
action_sig = str(args)
1088+
if self.last_action == action_sig:
1089+
self.repeat_time += 1
1090+
else:
1091+
self.repeat_time = 0
1092+
self.last_action = action_sig
1093+
1094+
try:
1095+
act = json_action.JSONAction(**parsed)
1096+
self.env.execute_action(act)
1097+
time.sleep(self.wait_after_action_seconds)
1098+
except Exception:
1099+
# continue
1100+
print("Failed to execute action:", parsed)
1101+
1102+
if parsed.get("action_type") == "status":
1103+
return base_agent.AgentInteractionResult(
1104+
True, {"response": response, "step_history": self.step_his, "parsed": parsed}
1105+
)
1106+
1107+
# If repeated actions reach the threshold: terminate immediately to avoid deadlock in evaluation
1108+
if self.repeat_time >= 3:
1109+
return base_agent.AgentInteractionResult(
1110+
True,
1111+
{
1112+
"summary": "Terminated due to repeated identical actions.",
1113+
"response": response,
1114+
"step_history": self.step_his,
1115+
"parsed": parsed,
1116+
"repeat_time": self.repeat_time,
1117+
},
1118+
)
1119+
1120+
return base_agent.AgentInteractionResult(
1121+
False, {"response": response, "step_history": self.step_his, "parsed": parsed}
1122+
)

evaluation/AndroidWorld/android_world/agents/utils.py

Lines changed: 61 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,13 @@
77
from android_world.env import interface
88
from android_world.env import json_action
99
from android_world.agents import base_agent
10-
from transformers.models.qwen2_vl.image_processing_qwen2_vl_fast import smart_resize
10+
11+
try:
12+
from transformers.models.qwen2_vl.image_processing_qwen2_vl_fast import ( # type: ignore
13+
smart_resize,
14+
)
15+
except Exception: # pragma: no cover
16+
smart_resize = None # type: ignore[assignment]
1117

1218

1319
def _extract_xy(s: str) -> Tuple[float, float] | None:
@@ -156,6 +162,60 @@ def action_transform(action: str, width: int, height: int) -> Dict[str, Any] | N
156162
return None
157163

158164

165+
def qwen3vl_action_transform(action, arguments, width, height) -> Dict[str, Any]:
166+
if action == "key":
167+
return {"action_type": "wait"}
168+
elif action == "click" or action == "left_click":
169+
coordinate = arguments.get("coordinate", [0, 0])
170+
x, y = coordinate
171+
x = x / 1000 * width
172+
y = y / 1000 * height
173+
return {"action_type": "click", "x": x, "y": y}
174+
elif action == "long_press":
175+
coordinate = arguments.get("coordinate", [0, 0])
176+
x, y = coordinate
177+
x = x / 1000 * width
178+
y = y / 1000 * height
179+
return {"action_type": "long_press", "x": x, "y": y}
180+
elif action == "swipe":
181+
coordinate = arguments.get("coordinate", [0, 0])
182+
coordinate2 = arguments.get("coordinate2", [0, 0])
183+
x0, y0 = coordinate[0]/1000 * width, coordinate[1]/1000 * height
184+
x1, y1 = coordinate2[0]/1000 * width, coordinate2[1]/1000 * height
185+
dir_ = _dir_from_coords(x0, y0, x1, y1)
186+
return {"action_type": "scroll", "direction": reverse_direction(dir_)}
187+
elif action == "type":
188+
text = arguments.get("text", "")
189+
return {"action_type": "input_text", "text": text}
190+
elif action == "system_button":
191+
button = arguments.get("button", "").lower()
192+
if button == "home":
193+
return {"action_type": "navigate_home"}
194+
elif button == "back":
195+
return {"action_type": "navigate_back"}
196+
else:
197+
raise ValueError(f"Unknown system button: {button}")
198+
elif action == "open":
199+
text = arguments.get("text", "")
200+
return {"action_type": "open_app", "app_name": text}
201+
elif action == "wait":
202+
return {"action_type": "wait"}
203+
elif action == "answer":
204+
return {"action_type": "answer", "text": arguments.get("text", "")}
205+
elif action == "terminate":
206+
status = arguments.get("status", "").lower()
207+
if status == "success":
208+
return {"action_type": "status", "goal_status": "complete"}
209+
elif status == "failure":
210+
return {"action_type": "status", "goal_status": "infeasible"}
211+
else:
212+
raise ValueError(f"Unknown terminate status: {status}")
213+
# else:
214+
# raise ValueError(f"Unknown action: {action}")
215+
else:
216+
return {'action_type': 'wait'}
217+
218+
159219
def action_coord(action):
160220
def extract_click_json(s):
161221
m = re.search(

0 commit comments

Comments
 (0)