You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat(core): enable thinking for vqa
* fix(core): ci
* feat(core): show thought in report
* docs(core): update docs for qwen3
* docs(core): update docs for qwen3
* docs(core): update docs for qwen3
Besides the ability to understand text and image input, VL(Visual-Language) models can also locate the coordinates of target elements on the page.
13
+
In addition to understanding text and image input, VL(Visual-Language) models can locate the coordinates of target elements on the page.
23
14
24
-
We recommend using VL models for UI automation because they can natively "see" screenshots and return the coordinates of target elements on the page, making them more reliable and efficient in complex scenarios.
15
+
We recommend VL models for UI automation because they can natively "see" screenshots and return the coordinates of target elements, which is more reliable and efficient in complex scenarios.
25
16
26
-
VL models can be used in UI automation of any kind of interfaces.
17
+
VL models can be used for UI automation across any kind of interface.
27
18
28
-
These are the adapted VL models:
19
+
These VL models are already adapted for Midscene.js:
29
20
30
-
*[Qwen-2.5-VL](#qwen-25-vl)
31
-
*[Doubao series visuallanguage models](#doubao-vision-models)
21
+
*[QwenVL](#qwen3-vl-or-qwen-25-vl)
22
+
*[Doubao visual-language models](#doubao-vision)
32
23
*[`Gemini-2.5-Pro`](#gemini-25-pro)
33
24
*[`UI-TARS`](#ui-tars)
34
25
35
-
If you want to see the detailed configuration of model services, see [Config Model and Provider](./model-provider).
26
+
If you want to learn the detailed configuration for each model provider, see [Config Model and Provider](./model-provider).
36
27
37
-
##Models in depth
28
+
### LLM models
38
29
39
-
<divid="gpt-4o"></div>
40
-
### `GPT-4o`
30
+
Models that can understand text and image input. GPT-4o is this kind of model.
31
+
32
+
LLM models can only be used in web automation.
41
33
42
-
GPT-4o is a multimodal LLM by OpenAI, which supports image input. This is the default model for Midscene.js. When using GPT-4o, a step-by-step prompting is preferred.
34
+
*[`GPT-4o`](#gpt-4o)
35
+
*[Other LLM models](#other-llm-models)
43
36
44
-
The token cost of using GPT-4o is higher since Midscene needs to send some dom info and screenshot to the model, and it's not stable in complex scenarios.
37
+
## Models in depth
45
38
46
-
**Config**
39
+
<divid="qwen3-vl-or-qwen-25-vl"></div>
47
40
48
-
```bash
49
-
OPENAI_API_KEY="......"
50
-
OPENAI_BASE_URL="https://custom-endpoint.com/compatible-mode/v1"# optional, if you want an endpoint other than the default one from OpenAI.
51
-
MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20"# optional. The default is "gpt-4o".
52
-
```
41
+
### Qwen VL (✨ recommended)
53
42
54
-
<divid="qwen-25-vl"></div>
55
-
### Qwen 2.5-VL (Openrouter or Alibaba Cloud)
43
+
Qwen-VL is an open-source model series released by Alibaba. It offers visual grounding and can accurately return the coordinates of target elements on a page. The models show strong performance for interaction, assertion, and querying tasks. Deployed versions are available on [Alibaba Cloud](https://help.aliyun.com/zh/model-studio/vision) and [OpenRouter](https://openrouter.ai/qwen).
56
44
57
-
Starting from version 0.12.0, Midscene.js supports the Qwen-2.5-VL-72B-Instruct model series.
45
+
Midscene.js supports the following versions:
46
+
* Qwen2.5-VL series (higher parameter counts provide better quality)
47
+
* Qwen3-VL series, including `qwen3-vl-plus` (commercial) and `qwen3-vl-235b-a22b-instruct` (open source)
58
48
59
-
Qwen-2.5-VL is an open-source model series published by Alibaba. It provides visual grounding capability and can accurately return the coordinates of target elements on the page. It performs quite well when used for interaction, assertion, and query. We recommend using the largest version (72B) for reliable output.
49
+
We recommend the Qwen3-VL series, which clearly outperforms Qwen2.5-VL. Qwen3-VL requires Midscene v0.29.3 or later.
60
50
61
-
**config**
51
+
**Config for Qwen3-VL**
62
52
63
-
After applying for the API key on [Openrouter](https://openrouter.ai) you can use the following config:
53
+
Using the Alibaba Cloud `qwen3-vl-plus` model as an example:
MIDSCENE_USE_QWEN_VL=1# Note: cannot be set together with MIDSCENE_USE_QWEN3_VL
79
71
```
80
72
81
-
**Limitations when used in Midscene.js**
82
-
83
-
-**Not good at recognizing small icons**: When recognizing small icons, you may need to [enable the `deepThink` parameter](./blog-introducing-instant-actions-and-deep-think) and optimize the description, otherwise the recognition results may not be accurate.
84
-
-**Unstable assertion performance**: We observed that it may not perform as well as `GPT-4o` or `Doubao-1.5-thinking-vision-pro` in assertion.
85
-
86
-
**Note about model deployment on Alibaba Cloud**
87
-
88
-
While the open-source version of Qwen-2.5-VL (72B) is named `qwen2.5-vl-72b-instruct`, there is also an enhanced and more stable version named `qwen-vl-max-latest` officially hosted on Alibaba Cloud. When using the `qwen-vl-max-latest` model on Alibaba Cloud, you will get larger context support and a much lower price (possibly only 19% of the open-source version).
### Doubao series visual language models (Volcano Engine, ✨ recommended)
107
+
### `Gemini-2.5-Pro`
122
108
123
-
Volcano Engine provides multiple visual language models, including:
124
-
*`Doubao-1.5-thinking-vision-pro`
125
-
*`Doubao-seed-1.6-vision`
109
+
Starting from version 0.15.1, Midscene.js supports the Gemini-2.5-Pro model. Gemini 2.5 Pro is a proprietary model provided by Google Cloud.
126
110
127
-
They perform quite well in visual grounding and assertion in complex scenarios. With clear instructions, they can meet most business scenario requirements and are currently the most recommended visual language models for Midscene.
111
+
When using Gemini-2.5-Pro, set `MIDSCENE_USE_GEMINI=1` to enable Gemini-specific behavior.
128
112
129
113
**Config**
130
114
131
-
After obtaining API key from [Volcano Engine](https://volcengine.com), you can use the following configuration:
115
+
After applying for the API key on [Google Gemini](https://gemini.google.com/), you can use the following config:
-[Gemini 2.5 on Google Cloud](https://cloud.google.com/gemini-api/docs/gemini-25-overview)
144
126
145
127
<divid="ui-tars"></div>
146
128
147
-
### `UI-TARS` (Volcano Engine)
129
+
### `UI-TARS`
148
130
149
-
UI-TARS is an end-to-end GUI agent model based on VLM architecture. It only perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations), achieving state-of-the-art performance on 10+ GUI benchmarks. UI-TARS is an open-source model and provides different size versions.
131
+
UI-TARS is an end-to-end GUI agent model based on a VLM architecture. It takes screenshots as input and performs human-like interactions (keyboard, mouse, etc.), achieving state-of-the-art performance across 10+ GUI benchmarks. UI-TARS is opensource and available in multiple sizes.
150
132
151
-
When using UI-TARS, you can use target-driven style prompts, like "Login with user name foo and password bar", and it will plan the steps to achieve the goal.
133
+
With UI-TARS you can use goal-driven prompts, such as "Log in with username foo and password bar". The model will plan the steps needed to accomplish the task.
152
134
153
135
**Config**
154
136
155
137
You can use the deployed `doubao-1.5-ui-tars` on [Volcano Engine](https://volcengine.com).
MIDSCENE_MODEL_NAME="ep-2025..."# Inference endpoint ID or model name from Volcano Engine
161
143
MIDSCENE_USE_VLM_UI_TARS=DOUBAO
162
144
```
163
145
164
146
**Limitations**
165
147
166
-
-**Poor assertion performance**: It may not perform as well as GPT-4o and Qwen 2.5 in assertion and query.
167
-
-**Unstable operation path**: It may try different paths to achieve the goal, so the operation path is unstable each time you call it.
148
+
-**Weak assertion performance**: It may not perform as well as GPT-4o or Qwen 2.5 for assertion and query tasks.
149
+
-**Unstable action planning**: It may attempt different paths on each run, so the operation path is not deterministic.
168
150
169
151
**About the `MIDSCENE_USE_VLM_UI_TARS` configuration**
170
152
171
-
The`MIDSCENE_USE_VLM_UI_TARS`configuration is used to specify the UI-TARS version, using one of the following values:
153
+
Use`MIDSCENE_USE_VLM_UI_TARS` to specify the UI-TARS version with one of the following values:
172
154
-`1.0` - for model version `1.0`
173
155
-`1.5` - for model version `1.5`
174
-
-`DOUBAO` - for the model deployed on Volcano Engine
156
+
-`DOUBAO` - for the Volcano Engine deployment
175
157
176
158
**Links**
177
159
-[UI-TARS on 🤗 HuggingFace](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)
178
-
-[UI-TARS on Github](https://github.com/bytedance/ui-tars)
160
+
-[UI-TARS on GitHub](https://github.com/bytedance/ui-tars)
179
161
-[UI-TARS - Model Deployment Guide](https://juniper-switch-f10.notion.site/UI-TARS-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
180
162
-[UI-TARS on Volcengine](https://www.volcengine.com/docs/82379/1536429)
181
163
164
+
<divid="gpt-4o"></div>
165
+
### `GPT-4o`
166
+
167
+
GPT-4o is a multimodal LLM by OpenAI that supports image input. This is the default model for Midscene.js. When using GPT-4o, step-by-step prompting generally works best.
168
+
169
+
The token cost of GPT-4o is relatively high because Midscene sends DOM information and screenshots to the model, and it can be unstable in complex scenarios.
170
+
171
+
**Config**
172
+
173
+
```bash
174
+
OPENAI_API_KEY="......"
175
+
OPENAI_BASE_URL="https://custom-endpoint.com/compatible-mode/v1"# Optional, if you want an endpoint other than the default OpenAI one.
176
+
MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20"# Optional. The default is "gpt-4o".
0 commit comments