Skip to content

Commit 29b4ead

Browse files
authored
docs(core): doc for prompting with images (#984)
1 parent 5b1930d commit 29b4ead

File tree

2 files changed

+160
-47
lines changed

2 files changed

+160
-47
lines changed

apps/site/docs/en/api.mdx

Lines changed: 80 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -98,12 +98,12 @@ Tap something.
9898
- Type
9999

100100
```typescript
101-
function aiTap(locate: string, options?: Object): Promise<void>;
101+
function aiTap(locate: string | Object, options?: Object): Promise<void>;
102102
```
103103

104104
- Parameters:
105105

106-
- `locate: string` - A natural language description of the element to tap.
106+
- `locate: string | Object` - A natural language description of the element to tap, or [prompting with images](#prompting-with-images).
107107
- `options?: Object` - Optional, a configuration object containing:
108108
- `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
109109
- `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -133,12 +133,12 @@ Move mouse over something.
133133
- Type
134134

135135
```typescript
136-
function aiHover(locate: string, options?: Object): Promise<void>;
136+
function aiHover(locate: string | Object, options?: Object): Promise<void>;
137137
```
138138

139139
- Parameters:
140140

141-
- `locate: string` - A natural language description of the element to hover over.
141+
- `locate: string | Object` - A natural language description of the element to hover over, or [prompting with images](#prompting-with-images).
142142
- `options?: Object` - Optional, a configuration object containing:
143143
- `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
144144
- `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -161,13 +161,17 @@ Input text into something.
161161
- Type
162162

163163
```typescript
164-
function aiInput(text: string, locate: string, options?: Object): Promise<void>;
164+
function aiInput(
165+
text: string | Object,
166+
locate: string,
167+
options?: Object,
168+
): Promise<void>;
165169
```
166170

167171
- Parameters:
168172

169173
- `text: string` - The final text content that should be placed in the input element. Use blank string to clear the input.
170-
- `locate: string` - A natural language description of the element to input text into.
174+
- `locate: string | Object` - A natural language description of the element to input text into, or [prompting with images](#prompting-with-images).
171175
- `options?: Object` - Optional, a configuration object containing:
172176
- `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
173177
- `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -193,15 +197,15 @@ Press a keyboard key.
193197
```typescript
194198
function aiKeyboardPress(
195199
key: string,
196-
locate?: string,
200+
locate?: string | Object,
197201
options?: Object,
198202
): Promise<void>;
199203
```
200204

201205
- Parameters:
202206

203207
- `key: string` - The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported.
204-
- `locate?: string` - Optional, a natural language description of the element to press the key on.
208+
- `locate?: string | Object` - Optional, a natural language description of the element to press the key on, or [prompting with images](#prompting-with-images).
205209
- `options?: Object` - Optional, a configuration object containing:
206210
- `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
207211
- `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -226,7 +230,7 @@ Scroll a page or an element.
226230
```typescript
227231
function aiScroll(
228232
scrollParam: PlanningActionParamScroll,
229-
locate?: string,
233+
locate?: string | Object,
230234
options?: Object,
231235
): Promise<void>;
232236
```
@@ -237,7 +241,7 @@ function aiScroll(
237241
- `direction: 'up' | 'down' | 'left' | 'right'` - The direction to scroll.
238242
- `scrollType: 'once' | 'untilBottom' | 'untilTop' | 'untilRight' | 'untilLeft'` - Optional, the type of scroll to perform.
239243
- `distance: number` - Optional, the distance to scroll in px.
240-
- `locate?: string` - Optional, a natural language description of the element to scroll on. If not provided, Midscene will perform scroll on the current mouse position.
244+
- `locate?: string | Object` - Optional, a natural language description of the element to scroll on, or [prompting with images](#prompting-with-images). If not provided, Midscene will perform scroll on the current mouse position.
241245
- `options?: Object` - Optional, a configuration object containing:
242246
- `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
243247
- `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -270,7 +274,7 @@ function aiRightClick(locate: string, options?: Object): Promise<void>;
270274

271275
- Parameters:
272276

273-
- `locate: string` - A natural language description of the element to right-click on.
277+
- `locate: string | Object` - A natural language description of the element to right-click on, or [prompting with images](#prompting-with-images).
274278
- `options?: Object` - Optional, a configuration object containing:
275279
- `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
276280
- `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -306,12 +310,12 @@ Ask the AI model any question about the current page. It returns the answer in s
306310
- Type
307311

308312
```typescript
309-
function aiAsk(prompt: string, options?: Object): Promise<string>;
313+
function aiAsk(prompt: string | Object, options?: Object): Promise<string>;
310314
```
311315

312316
- Parameters:
313317

314-
- `prompt: string` - A natural language description of the question.
318+
- `prompt: string | Object` - A natural language description of the question, or [prompting with images](#prompting-with-images).
315319
- `options?: Object` - Optional, a configuration object containing:
316320
- `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. Default: False.
317321
- `screenshotIncluded?: boolean` - Whether to send screenshot to the model. Default: True.
@@ -386,11 +390,11 @@ Extract a boolean value from the UI.
386390
- Type
387391

388392
```typescript
389-
function aiBoolean(prompt: string, options?: Object): Promise<boolean>;
393+
function aiBoolean(prompt: string | Object, options?: Object): Promise<boolean>;
390394
```
391395

392396
- Parameters:
393-
- `prompt: string` - A natural language description of the expected value.
397+
- `prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
394398
- `options?: Object` - Optional, a configuration object containing:
395399
- `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. Default: False.
396400
- `screenshotIncluded?: boolean` - Whether to send screenshot to the model. Default: True.
@@ -416,11 +420,11 @@ Extract a number value from the UI.
416420
- Type
417421

418422
```typescript
419-
function aiNumber(prompt: string, options?: Object): Promise<number>;
423+
function aiNumber(prompt: string | Object, options?: Object): Promise<number>;
420424
```
421425

422426
- Parameters:
423-
- `prompt: string` - A natural language description of the expected value.
427+
- `prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
424428
- `options?: Object` - Optional, a configuration object containing:
425429
- `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. Default: False.
426430
- `screenshotIncluded?: boolean` - Whether to send screenshot to the model. Default: True.
@@ -447,11 +451,11 @@ Extract a string value from the UI.
447451
- Type
448452

449453
```typescript
450-
function aiString(prompt: string, options?: Object): Promise<string>;
454+
function aiString(prompt: string | Object, options?: Object): Promise<string>;
451455
```
452456

453457
- Parameters:
454-
- `prompt: string` - A natural language description of the expected value.
458+
- `prompt: string | Object` - A natural language description of the expected value, or [prompting with images](#prompting-with-images).
455459
- `options?: Object` - Optional, a configuration object containing:
456460
- `domIncluded?: boolean | 'visible-only'` - Whether to send simplified DOM information to the model, usually used for extracting invisible attributes like image links. If set to `'visible-only'`, only the visible elements will be sent. Default: False.
457461
- `screenshotIncluded?: boolean` - Whether to send screenshot to the model. Default: True.
@@ -479,12 +483,12 @@ Specify an assertion in natural language, and the AI determines whether the cond
479483
- Type
480484

481485
```typescript
482-
function aiAssert(assertion: string, errorMsg?: string): Promise<void>;
486+
function aiAssert(assertion: string | Object, errorMsg?: string): Promise<void>;
483487
```
484488

485489
- Parameters:
486490

487-
- `assertion: string` - The assertion described in natural language.
491+
- `assertion: string | Object` - The assertion described in natural language, or [prompting with images](#prompting-with-images).
488492
- `errorMsg?: string` - An optional error message to append if the assertion fails.
489493

490494
- Return Value:
@@ -521,7 +525,7 @@ Locate an element using natural language.
521525

522526
```typescript
523527
function aiLocate(
524-
locate: string,
528+
locate: string | Object,
525529
options?: Object,
526530
): Promise<{
527531
rect: {
@@ -537,7 +541,7 @@ function aiLocate(
537541

538542
- Parameters:
539543

540-
- `locate: string` - A natural language description of the element to locate.
544+
- `locate: string | Object` - A natural language description of the element to locate, or [prompting with images](#prompting-with-images).
541545
- `options?: Object` - Optional, a configuration object containing:
542546
- `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. False by default.
543547
- `xpath?: string` - The xpath of the element to operate. If provided, Midscene will first use this xpath to locate the element before using the cache and the AI model. Empty by default.
@@ -807,3 +811,56 @@ After starting Midscene, you should see logs similar to:
807811
```log
808812
DEBUGGING MODE: langsmith wrapper enabled
809813
```
814+
815+
## Advanced features
816+
817+
### Prompting with images
818+
819+
You can use images as supplements in the prompt to describe things that cannot be expressed in natural language.
820+
821+
When prompting with images, the format of the prompt parameters is as follows:
822+
823+
```javascript
824+
{
825+
// Prompt text, in which images can be referred
826+
prompt: string,
827+
// The images referred in the prompt text
828+
images?: {
829+
// Image name, corresponding to the names referred in the prompt text
830+
name: string,
831+
// Image url, can be a local image path, Base64 string, or http link
832+
url: string
833+
}[]
834+
// When convertHttpImage2Base64 is true,the image links in the http format will be converted into Base64 encoding and sent to the LLM.
835+
// Which is applicable when the image links are not publicly accessible.
836+
convertHttpImage2Base64?: boolean
837+
}
838+
```
839+
840+
- Example 1: use images to inspect the tap position.
841+
842+
```javascript
843+
await agent.aiTap({
844+
prompt: 'The specific logo',
845+
images: [
846+
{
847+
name: 'The specific logo',
848+
url: 'https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png',
849+
},
850+
],
851+
});
852+
```
853+
854+
- Example 2: use images to assert the page content.
855+
856+
```javascript
857+
await agent.aiAssert({
858+
prompt: 'Whether there is a specific logo on the page.',
859+
images: [
860+
{
861+
name: 'The specific logo',
862+
url: 'https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png',
863+
},
864+
],
865+
});
866+
```

0 commit comments

Comments
 (0)