Skip to content

Commit a76235e

Browse files
authored
docs(core): update docs for qwen3 (#1252)
* feat(core): enable thinking for vqa * fix(core): ci * feat(core): show thought in report * docs(core): update docs for qwen3 * docs(core): update docs for qwen3 * docs(core): update docs for qwen3
1 parent 5951cb8 commit a76235e

File tree

2 files changed

+127
-133
lines changed

2 files changed

+127
-133
lines changed

apps/site/docs/en/choose-a-model.mdx

Lines changed: 83 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -6,179 +6,175 @@ Choose one of the following models, obtain the API key, complete the configurati
66

77
## Adapted models for using Midscene.js
88

9-
Midscene.js supports two types of models, which are:
10-
11-
### LLM models
12-
13-
Models that can understand text and image input. GPT-4o is this kind of model.
14-
15-
LLM models can only be used in web automation.
16-
17-
* [`GPT-4o`](#gpt-4o)
18-
* [Other LLM models](#other-llm-models)
9+
Midscene.js supports two types of models:
1910

2011
### Visual-Language models (VL models, ✨ recommended)
2112

22-
Besides the ability to understand text and image input, VL(Visual-Language) models can also locate the coordinates of target elements on the page.
13+
In addition to understanding text and image input, VL (Visual-Language) models can locate the coordinates of target elements on the page.
2314

24-
We recommend using VL models for UI automation because they can natively "see" screenshots and return the coordinates of target elements on the page, making them more reliable and efficient in complex scenarios.
15+
We recommend VL models for UI automation because they can natively "see" screenshots and return the coordinates of target elements, which is more reliable and efficient in complex scenarios.
2516

26-
VL models can be used in UI automation of any kind of interfaces.
17+
VL models can be used for UI automation across any kind of interface.
2718

28-
These are the adapted VL models:
19+
These VL models are already adapted for Midscene.js:
2920

30-
* [Qwen-2.5-VL](#qwen-25-vl)
31-
* [Doubao series visual language models](#doubao-vision-models)
21+
* [Qwen VL](#qwen3-vl-or-qwen-25-vl)
22+
* [Doubao visual-language models](#doubao-vision)
3223
* [`Gemini-2.5-Pro`](#gemini-25-pro)
3324
* [`UI-TARS`](#ui-tars)
3425

35-
If you want to see the detailed configuration of model services, see [Config Model and Provider](./model-provider).
26+
If you want to learn the detailed configuration for each model provider, see [Config Model and Provider](./model-provider).
3627

37-
## Models in depth
28+
### LLM models
3829

39-
<div id="gpt-4o"></div>
40-
### `GPT-4o`
30+
Models that can understand text and image input. GPT-4o is this kind of model.
31+
32+
LLM models can only be used in web automation.
4133

42-
GPT-4o is a multimodal LLM by OpenAI, which supports image input. This is the default model for Midscene.js. When using GPT-4o, a step-by-step prompting is preferred.
34+
* [`GPT-4o`](#gpt-4o)
35+
* [Other LLM models](#other-llm-models)
4336

44-
The token cost of using GPT-4o is higher since Midscene needs to send some dom info and screenshot to the model, and it's not stable in complex scenarios.
37+
## Models in depth
4538

46-
**Config**
39+
<div id="qwen3-vl-or-qwen-25-vl"></div>
4740

48-
```bash
49-
OPENAI_API_KEY="......"
50-
OPENAI_BASE_URL="https://custom-endpoint.com/compatible-mode/v1" # optional, if you want an endpoint other than the default one from OpenAI.
51-
MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20" # optional. The default is "gpt-4o".
52-
```
41+
### Qwen VL (✨ recommended)
5342

54-
<div id="qwen-25-vl"></div>
55-
### Qwen 2.5-VL (Openrouter or Alibaba Cloud)
43+
Qwen-VL is an open-source model series released by Alibaba. It offers visual grounding and can accurately return the coordinates of target elements on a page. The models show strong performance for interaction, assertion, and querying tasks. Deployed versions are available on [Alibaba Cloud](https://help.aliyun.com/zh/model-studio/vision) and [OpenRouter](https://openrouter.ai/qwen).
5644

57-
Starting from version 0.12.0, Midscene.js supports the Qwen-2.5-VL-72B-Instruct model series.
45+
Midscene.js supports the following versions:
46+
* Qwen2.5-VL series (higher parameter counts provide better quality)
47+
* Qwen3-VL series, including `qwen3-vl-plus` (commercial) and `qwen3-vl-235b-a22b-instruct` (open source)
5848

59-
Qwen-2.5-VL is an open-source model series published by Alibaba. It provides visual grounding capability and can accurately return the coordinates of target elements on the page. It performs quite well when used for interaction, assertion, and query. We recommend using the largest version (72B) for reliable output.
49+
We recommend the Qwen3-VL series, which clearly outperforms Qwen2.5-VL. Qwen3-VL requires Midscene v0.29.3 or later.
6050

61-
**config**
51+
**Config for Qwen3-VL**
6252

63-
After applying for the API key on [Openrouter](https://openrouter.ai) you can use the following config:
53+
Using the Alibaba Cloud `qwen3-vl-plus` model as an example:
6454

6555
```bash
66-
OPENAI_BASE_URL="https://openrouter.ai/api/v1"
56+
OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
6757
OPENAI_API_KEY="......"
68-
MIDSCENE_MODEL_NAME="qwen/qwen2.5-vl-72b-instruct"
69-
MIDSCENE_USE_QWEN_VL=1
58+
MIDSCENE_MODEL_NAME="qwen3-vl-plus"
59+
MIDSCENE_USE_QWEN3_VL=1 # Note: cannot be set together with MIDSCENE_USE_QWEN_VL
7060
```
7161

72-
Or using Alibaba Cloud:
62+
**Config for Qwen2.5-VL**
63+
64+
Using the Alibaba Cloud `qwen-vl-max-latest` model as an example:
7365

7466
```bash
75-
OPENAI_BASE_URL="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
67+
OPENAI_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1"
7668
OPENAI_API_KEY="......"
7769
MIDSCENE_MODEL_NAME="qwen-vl-max-latest"
78-
MIDSCENE_USE_QWEN_VL=1
70+
MIDSCENE_USE_QWEN_VL=1 # Note: cannot be set together with MIDSCENE_USE_QWEN3_VL
7971
```
8072

81-
**Limitations when used in Midscene.js**
82-
83-
- **Not good at recognizing small icons**: When recognizing small icons, you may need to [enable the `deepThink` parameter](./blog-introducing-instant-actions-and-deep-think) and optimize the description, otherwise the recognition results may not be accurate.
84-
- **Unstable assertion performance**: We observed that it may not perform as well as `GPT-4o` or `Doubao-1.5-thinking-vision-pro` in assertion.
85-
86-
**Note about model deployment on Alibaba Cloud**
87-
88-
While the open-source version of Qwen-2.5-VL (72B) is named `qwen2.5-vl-72b-instruct`, there is also an enhanced and more stable version named `qwen-vl-max-latest` officially hosted on Alibaba Cloud. When using the `qwen-vl-max-latest` model on Alibaba Cloud, you will get larger context support and a much lower price (possibly only 19% of the open-source version).
73+
**Links**
74+
- [Alibaba Cloud - Qwen-VL series](https://help.aliyun.com/zh/model-studio/vision)
75+
- [Qwen on 🤗 HuggingFace](https://huggingface.co/Qwen)
76+
- [Qwen on GitHub](https://github.com/QwenLM/)
77+
- [Qwen on openrouter.ai](https://openrouter.ai/qwen)
8978

90-
In short, if you want to use Alibaba Cloud service, please use `qwen-vl-max-latest`.
9179

92-
**Links**
93-
- [Qwen 2.5 on 🤗 HuggingFace](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)
94-
- [Qwen 2.5 on Github](https://github.com/QwenLM/Qwen2.5-VL)
95-
- [Qwen 2.5 on Aliyun](https://bailian.console.aliyun.com/#/model-market/detail/qwen-vl-max-latest)
96-
- [Qwen 2.5 on openrouter.ai](https://openrouter.ai/qwen/qwen2.5-vl-72b-instruct)
80+
<div id="doubao-vision"></div>
9781

98-
<div id="gemini-25-pro"></div>
99-
### `Gemini-2.5-Pro` (Google Gemini)
82+
### Doubao visual-language models
10083

101-
Starting from version 0.15.1, Midscene.js supports the Gemini-2.5-Pro model. Gemini 2.5 Pro is a closed-source model provided by Google Cloud.
84+
Volcano Engine provides multiple visual-language models, including:
85+
* `Doubao-1.5-thinking-vision-pro`
86+
* `Doubao-seed-1.6-vision`
10287

103-
When using Gemini-2.5-Pro, you should use the `MIDSCENE_USE_GEMINI=1` configuration to enable Gemini-2.5-Pro mode.
88+
They perform strongly for visual grounding and assertion in complex scenarios. With clear instructions they can handle most business needs.
10489

10590
**Config**
10691

107-
After applying for the API key on [Google Gemini](https://gemini.google.com/), you can use the following config:
92+
After obtaining an API key from [Volcano Engine](https://volcengine.com), you can use the following configuration:
10893

10994
```bash
110-
OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
111-
OPENAI_API_KEY="......"
112-
MIDSCENE_MODEL_NAME="gemini-2.5-pro-preview-05-06"
113-
MIDSCENE_USE_GEMINI=1
95+
OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
96+
OPENAI_API_KEY="...."
97+
MIDSCENE_MODEL_NAME="ep-..." # Inference endpoint ID or model name from Volcano Engine
98+
MIDSCENE_USE_DOUBAO_VISION=1
11499
```
115100

116101
**Links**
117-
- [Gemini 2.5 on Google Cloud](https://cloud.google.com/gemini-api/docs/gemini-25-overview)
102+
- [Volcano Engine - Doubao-1.5-thinking-vision-pro](https://www.volcengine.com/docs/82379/1536428)
103+
- [Volcano Engine - Doubao-Seed-1.6-Vision](https://www.volcengine.com/docs/82379/1799865)
118104

119-
<div id="doubao-vision-models"></div>
105+
<div id="gemini-25-pro"></div>
120106

121-
### Doubao series visual language models (Volcano Engine, ✨ recommended)
107+
### `Gemini-2.5-Pro`
122108

123-
Volcano Engine provides multiple visual language models, including:
124-
* `Doubao-1.5-thinking-vision-pro`
125-
* `Doubao-seed-1.6-vision`
109+
Starting from version 0.15.1, Midscene.js supports the Gemini-2.5-Pro model. Gemini 2.5 Pro is a proprietary model provided by Google Cloud.
126110

127-
They perform quite well in visual grounding and assertion in complex scenarios. With clear instructions, they can meet most business scenario requirements and are currently the most recommended visual language models for Midscene.
111+
When using Gemini-2.5-Pro, set `MIDSCENE_USE_GEMINI=1` to enable Gemini-specific behavior.
128112

129113
**Config**
130114

131-
After obtaining API key from [Volcano Engine](https://volcengine.com), you can use the following configuration:
115+
After applying for the API key on [Google Gemini](https://gemini.google.com/), you can use the following config:
132116

133117
```bash
134-
OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
135-
OPENAI_API_KEY="...."
136-
MIDSCENE_MODEL_NAME="ep-..." # Inference endpoint ID or model name from Volcano Engine
137-
MIDSCENE_USE_DOUBAO_VISION=1
118+
OPENAI_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
119+
OPENAI_API_KEY="......"
120+
MIDSCENE_MODEL_NAME="gemini-2.5-pro-preview-05-06"
121+
MIDSCENE_USE_GEMINI=1
138122
```
139123

140124
**Links**
141-
- [Volcano Engine - Doubao-1.5-thinking-vision-pro](https://www.volcengine.com/docs/82379/1536428)
142-
- [Volcano Engine - Doubao-Seed-1.6-Vision](https://www.volcengine.com/docs/82379/1799865)
143-
125+
- [Gemini 2.5 on Google Cloud](https://cloud.google.com/gemini-api/docs/gemini-25-overview)
144126

145127
<div id="ui-tars"></div>
146128

147-
### `UI-TARS` (Volcano Engine)
129+
### `UI-TARS`
148130

149-
UI-TARS is an end-to-end GUI agent model based on VLM architecture. It only perceives screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations), achieving state-of-the-art performance on 10+ GUI benchmarks. UI-TARS is an open-source model and provides different size versions.
131+
UI-TARS is an end-to-end GUI agent model based on a VLM architecture. It takes screenshots as input and performs human-like interactions (keyboard, mouse, etc.), achieving state-of-the-art performance across 10+ GUI benchmarks. UI-TARS is open source and available in multiple sizes.
150132

151-
When using UI-TARS, you can use target-driven style prompts, like "Login with user name foo and password bar", and it will plan the steps to achieve the goal.
133+
With UI-TARS you can use goal-driven prompts, such as "Log in with username foo and password bar". The model will plan the steps needed to accomplish the task.
152134

153135
**Config**
154136

155137
You can use the deployed `doubao-1.5-ui-tars` on [Volcano Engine](https://volcengine.com).
156138

157139
```bash
158-
OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
140+
OPENAI_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
159141
OPENAI_API_KEY="...."
160142
MIDSCENE_MODEL_NAME="ep-2025..." # Inference endpoint ID or model name from Volcano Engine
161143
MIDSCENE_USE_VLM_UI_TARS=DOUBAO
162144
```
163145

164146
**Limitations**
165147

166-
- **Poor assertion performance**: It may not perform as well as GPT-4o and Qwen 2.5 in assertion and query.
167-
- **Unstable operation path**: It may try different paths to achieve the goal, so the operation path is unstable each time you call it.
148+
- **Weak assertion performance**: It may not perform as well as GPT-4o or Qwen 2.5 for assertion and query tasks.
149+
- **Unstable action planning**: It may attempt different paths on each run, so the operation path is not deterministic.
168150

169151
**About the `MIDSCENE_USE_VLM_UI_TARS` configuration**
170152

171-
The `MIDSCENE_USE_VLM_UI_TARS` configuration is used to specify the UI-TARS version, using one of the following values:
153+
Use `MIDSCENE_USE_VLM_UI_TARS` to specify the UI-TARS version with one of the following values:
172154
- `1.0` - for model version `1.0`
173155
- `1.5` - for model version `1.5`
174-
- `DOUBAO` - for the model deployed on Volcano Engine
156+
- `DOUBAO` - for the Volcano Engine deployment
175157

176158
**Links**
177159
- [UI-TARS on 🤗 HuggingFace](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)
178-
- [UI-TARS on Github](https://github.com/bytedance/ui-tars)
160+
- [UI-TARS on GitHub](https://github.com/bytedance/ui-tars)
179161
- [UI-TARS - Model Deployment Guide](https://juniper-switch-f10.notion.site/UI-TARS-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
180162
- [UI-TARS on Volcengine](https://www.volcengine.com/docs/82379/1536429)
181163

164+
<div id="gpt-4o"></div>
165+
### `GPT-4o`
166+
167+
GPT-4o is a multimodal LLM by OpenAI that supports image input. This is the default model for Midscene.js. When using GPT-4o, step-by-step prompting generally works best.
168+
169+
The token cost of GPT-4o is relatively high because Midscene sends DOM information and screenshots to the model, and it can be unstable in complex scenarios.
170+
171+
**Config**
172+
173+
```bash
174+
OPENAI_API_KEY="......"
175+
OPENAI_BASE_URL="https://custom-endpoint.com/compatible-mode/v1" # Optional, if you want an endpoint other than the default OpenAI one.
176+
MIDSCENE_MODEL_NAME="gpt-4o-2024-11-20" # Optional. The default is "gpt-4o".
177+
```
182178

183179
<div id="other-llm-models"></div>
184180
## Choose other multimodal LLMs

0 commit comments

Comments
 (0)