You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: apps/site/docs/en/api.mdx
+80-23Lines changed: 80 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -98,12 +98,12 @@ Tap something.
98
98
- Type
99
99
100
100
```typescript
101
-
function aiTap(locate:string, options?:Object):Promise<void>;
101
+
function aiTap(locate:string|Object, options?:Object):Promise<void>;
102
102
```
103
103
104
104
- Parameters:
105
105
106
-
-`locate: string` - A natural language description of the element to tap.
106
+
-`locate: string | Object` - A natural language description of the element to tap, or [prompting with images](#prompting-with-images).
107
107
-`options?: Object` - Optional, a configuration object containing:
108
108
-`deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
109
109
-`xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -133,12 +133,12 @@ Move mouse over something.
133
133
- Type
134
134
135
135
```typescript
136
-
function aiHover(locate:string, options?:Object):Promise<void>;
136
+
function aiHover(locate:string|Object, options?:Object):Promise<void>;
137
137
```
138
138
139
139
- Parameters:
140
140
141
-
-`locate: string` - A natural language description of the element to hover over.
141
+
-`locate: string | Object` - A natural language description of the element to hover over, or [prompting with images](#prompting-with-images).
142
142
-`options?: Object` - Optional, a configuration object containing:
143
143
-`deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
144
144
-`xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -161,13 +161,17 @@ Input text into something.
161
161
- Type
162
162
163
163
```typescript
164
-
function aiInput(text:string, locate:string, options?:Object):Promise<void>;
164
+
function aiInput(
165
+
text:string|Object,
166
+
locate:string,
167
+
options?:Object,
168
+
):Promise<void>;
165
169
```
166
170
167
171
- Parameters:
168
172
169
173
-`text: string` - The final text content that should be placed in the input element. Use blank string to clear the input.
170
-
-`locate: string` - A natural language description of the element to input text into.
174
+
-`locate: string | Object` - A natural language description of the element to input text into, or [prompting with images](#prompting-with-images).
171
175
-`options?: Object` - Optional, a configuration object containing:
172
176
-`deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
173
177
-`xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -193,15 +197,15 @@ Press a keyboard key.
193
197
```typescript
194
198
function aiKeyboardPress(
195
199
key:string,
196
-
locate?:string,
200
+
locate?:string|Object,
197
201
options?:Object,
198
202
):Promise<void>;
199
203
```
200
204
201
205
- Parameters:
202
206
203
207
-`key: string` - The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported.
204
-
-`locate?: string` - Optional, a natural language description of the element to press the key on.
208
+
-`locate?: string | Object` - Optional, a natural language description of the element to press the key on, or [prompting with images](#prompting-with-images).
205
209
-`options?: Object` - Optional, a configuration object containing:
206
210
-`deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
207
211
-`xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -226,7 +230,7 @@ Scroll a page or an element.
226
230
```typescript
227
231
function aiScroll(
228
232
scrollParam:PlanningActionParamScroll,
229
-
locate?:string,
233
+
locate?:string|Object,
230
234
options?:Object,
231
235
):Promise<void>;
232
236
```
@@ -237,7 +241,7 @@ function aiScroll(
237
241
-`direction: 'up' | 'down' | 'left' | 'right'` - The direction to scroll.
238
242
-`scrollType: 'once' | 'untilBottom' | 'untilTop' | 'untilRight' | 'untilLeft'` - Optional, the type of scroll to perform.
239
243
-`distance: number` - Optional, the distance to scroll in px.
240
-
-`locate?: string` - Optional, a natural language description of the element to scroll on. If not provided, Midscene will perform scroll on the current mouse position.
244
+
-`locate?: string | Object` - Optional, a natural language description of the element to scroll on, or [prompting with images](#prompting-with-images). If not provided, Midscene will perform scroll on the current mouse position.
241
245
-`options?: Object` - Optional, a configuration object containing:
242
246
-`deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
243
247
-`xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -270,7 +274,7 @@ function aiRightClick(locate: string, options?: Object): Promise<void>;
270
274
271
275
- Parameters:
272
276
273
-
-`locate: string` - A natural language description of the element to right-click on.
277
+
-`locate: string | Object` - A natural language description of the element to right-click on, or [prompting with images](#prompting-with-images).
274
278
-`options?: Object` - Optional, a configuration object containing:
275
279
-`deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
276
280
-`xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -306,12 +310,12 @@ Ask the AI model any question about the current page. It returns the answer in s
306
310
- Type
307
311
308
312
```typescript
309
-
function aiAsk(prompt:string, options?:Object):Promise<string>;
313
+
function aiAsk(prompt:string|Object, options?:Object):Promise<string>;
310
314
```
311
315
312
316
- Parameters:
313
317
314
-
-`prompt: string` - A natural language description of the question.
318
+
-`prompt: string | Object` - A natural language description of the question, or [prompting with images](#prompting-with-images).
315
319
-`options?: Object` - Optional, a configuration object containing:
316
320
-`domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. Default: False.
317
321
-`screenshotIncluded?: boolean` - Whether to send screenshot to the model. Default: True.
@@ -386,11 +390,11 @@ Extract a boolean value from the UI.
386
390
- Type
387
391
388
392
```typescript
389
-
function aiBoolean(prompt:string, options?:Object):Promise<boolean>;
393
+
function aiBoolean(prompt:string|Object, options?:Object):Promise<boolean>;
390
394
```
391
395
392
396
- Parameters:
393
-
-`prompt: string` - A natural language description of the expected value.
397
+
-`prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
394
398
-`options?: Object` - Optional, a configuration object containing:
395
399
-`domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. Default: False.
396
400
-`screenshotIncluded?: boolean` - Whether to send screenshot to the model. Default: True.
@@ -416,11 +420,11 @@ Extract a number value from the UI.
416
420
- Type
417
421
418
422
```typescript
419
-
function aiNumber(prompt:string, options?:Object):Promise<number>;
423
+
function aiNumber(prompt:string|Object, options?:Object):Promise<number>;
420
424
```
421
425
422
426
- Parameters:
423
-
-`prompt: string` - A natural language description of the expected value.
427
+
-`prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
424
428
-`options?: Object` - Optional, a configuration object containing:
425
429
-`domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. Default: False.
426
430
-`screenshotIncluded?: boolean` - Whether to send screenshot to the model. Default: True.
@@ -447,11 +451,11 @@ Extract a string value from the UI.
447
451
- Type
448
452
449
453
```typescript
450
-
function aiString(prompt:string, options?:Object):Promise<string>;
454
+
function aiString(prompt:string|Object, options?:Object):Promise<string>;
451
455
```
452
456
453
457
- Parameters:
454
-
-`prompt: string` - A natural language description of the expected value.
458
+
-`prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
455
459
-`options?: Object` - Optional, a configuration object containing:
456
460
-`domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. Default: False.
457
461
-`screenshotIncluded?: boolean` - Whether to send screenshot to the model. Default: True.
@@ -479,12 +483,12 @@ Specify an assertion in natural language, and the AI determines whether the cond
479
483
- Type
480
484
481
485
```typescript
482
-
function aiAssert(assertion:string, errorMsg?:string):Promise<void>;
486
+
function aiAssert(assertion:string|Object, errorMsg?:string):Promise<void>;
483
487
```
484
488
485
489
- Parameters:
486
490
487
-
-`assertion: string` - The assertion described in natural language.
491
+
-`assertion: string | Object` - The assertion described in natural language, or [prompting with images](#prompting-with-images).
488
492
-`errorMsg?: string` - An optional error message to append if the assertion fails.
489
493
490
494
- Return Value:
@@ -521,7 +525,7 @@ Locate an element using natural language.
521
525
522
526
```typescript
523
527
function aiLocate(
524
-
locate:string,
528
+
locate:string|Object,
525
529
options?:Object,
526
530
):Promise<{
527
531
rect: {
@@ -537,7 +541,7 @@ function aiLocate(
537
541
538
542
- Parameters:
539
543
540
-
-`locate: string` - A natural language description of the element to locate.
544
+
-`locate: string | Object` - A natural language description of the element to locate, or [prompting with images](#prompting-with-images).
541
545
-`options?: Object` - Optional, a configuration object containing:
542
546
-`deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
543
547
-`xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -807,3 +811,56 @@ After starting Midscene, you should see logs similar to:
807
811
```log
808
812
DEBUGGING MODE: langsmith wrapper enabled
809
813
```
814
+
815
+
## Advanced features
816
+
817
+
### Prompting with images
818
+
819
+
You can use images as supplements in the prompt to describe things that cannot be expressed in natural language.
820
+
821
+
When prompting with images, the format of the prompt parameters is as follows:
822
+
823
+
```javascript
824
+
{
825
+
// Prompt text, in which images can be referred
826
+
prompt: string,
827
+
// The images referred in the prompt text
828
+
images?: {
829
+
// Image name, corresponding to the names referred in the prompt text
830
+
name: string,
831
+
// Image url, can be a local image path, Base64 string, or http link
832
+
url: string
833
+
}[]
834
+
// When convertHttpImage2Base64 is true,the image links in the http format will be converted into Base64 encoding and sent to the LLM.
835
+
// Which is applicable when the image links are not publicly accessible.
836
+
convertHttpImage2Base64?: boolean
837
+
}
838
+
```
839
+
840
+
- Example 1: use images to inspect the tap position.
0 commit comments