Skip to content

Commit 38e2cef

Browse files
committed
Add multimodal tool outputs
1 parent 8f43009 commit 38e2cef

File tree

2 files changed

+104
-47
lines changed

2 files changed

+104
-47
lines changed

README.md

Lines changed: 95 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -137,48 +137,6 @@ const result = await multiUserSession.prompt([
137137

138138
Because of their special behavior of being preserved on context window overflow, system prompts cannot be provided this way.
139139

140-
### Tool use
141-
142-
The Prompt API supports **tool use** via the `tools` option, allowing you to define external capabilities that a language model can invoke in a model-agnostic way. Each tool is represented by an object that includes an `execute` member that specifies the JavaScript function to be called. When the language model initiates a tool use request, the user agent calls the corresponding `execute` function and sends the result back to the model.
143-
144-
Here’s an example of how to use the `tools` option:
145-
146-
```js
147-
const session = await LanguageModel.create({
148-
initialPrompts: [
149-
{
150-
role: "system",
151-
content: `You are a helpful assistant. You can use tools to help the user.`
152-
}
153-
],
154-
tools: [
155-
{
156-
name: "getWeather",
157-
description: "Get the weather in a location.",
158-
inputSchema: {
159-
type: "object",
160-
properties: {
161-
location: {
162-
type: "string",
163-
description: "The city to check for the weather condition.",
164-
},
165-
},
166-
required: ["location"],
167-
},
168-
async execute({ location }) {
169-
const res = await fetch("https://weatherapi.example/?location=" + location);
170-
// Returns the result as a JSON string.
171-
return JSON.stringify(await res.json());
172-
},
173-
}
174-
]
175-
});
176-
177-
const result = await session.prompt("What is the weather in Seattle?");
178-
```
179-
180-
In this example, the `tools` array defines a `getWeather` tool, specifying its name, description, input schema, and `execute` implementation. When the language model determines that a tool call is needed, the user agent invokes the `getWeather` tool's `execute()` function with the provided arguments and returns the result to the model, which can then incorporate it into its response.
181-
182140
### Multimodal inputs
183141

184142
All of the above examples have been of text prompts. Some language models also support other inputs. Our design initially includes the potential to support images and audio clips as inputs. This is done by using objects in the form `{ type: "image", content }` and `{ type: "audio", content }` instead of strings. The `content` values can be the following:
@@ -269,6 +227,101 @@ Details:
269227

270228
Future extensions may include more ambitious multimodal inputs, such as video clips, or realtime audio or video. (Realtime might require a different API design, more based around events or streams instead of messages.)
271229

230+
### Tool use
231+
232+
The Prompt API supports **tool use** via the `tools` option, allowing you to define external capabilities that a language model can invoke in a model-agnostic way. Each tool is represented by an object that includes an `execute` member that specifies the JavaScript function to be called. When the language model initiates a tool use request, the user agent calls the corresponding `execute` function and sends the result back to the model.
233+
234+
Here’s an example of how to use the `tools` option:
235+
236+
```js
237+
const session = await LanguageModel.create({
238+
initialPrompts: [
239+
{
240+
role: "system",
241+
content: `You are a helpful assistant. You can use tools to help the user.`
242+
}
243+
],
244+
tools: [
245+
{
246+
name: "getWeather",
247+
description: "Get the weather in a location.",
248+
inputSchema: {
249+
type: "object",
250+
properties: {
251+
location: {
252+
type: "string",
253+
description: "The city to check for the weather condition.",
254+
},
255+
},
256+
required: ["location"],
257+
},
258+
async execute({ location }) {
259+
const res = await fetch("https://weatherapi.example/?location=" + location);
260+
// Returns the result as a JSON string.
261+
return JSON.stringify(await res.json());
262+
},
263+
}
264+
]
265+
});
266+
267+
const result = await session.prompt("What is the weather in Seattle?");
268+
```
269+
270+
In this example, the `tools` array defines a `getWeather` tool, specifying its name, description, input schema, and `execute` implementation. When the language model determines that a tool call is needed, the user agent invokes the `getWeather` tool's `execute()` function with the provided arguments and returns the result to the model, which can then incorporate it into its response.
271+
272+
#### Tool return values
273+
274+
The above example shows tools returning a string. (In fact, stringified JSON.) Models which support [multimodal inputs](#multimodal-inputs) might also support interpreting image or audio results from tool calls.
275+
276+
Just like the `content` option to a `prompt()` call can accept either a string or an array of `{ type, value }` objects, web developer-provided tools can return either a string or such an array. Here's an example:
277+
278+
```js
279+
let mutex, resolveMutex;
280+
281+
const session = await LanguageModel.create({
282+
tools: [
283+
{
284+
name: "grabKeyframe",
285+
description: "Grab a keyframe from the video we're analyzing at the given time",
286+
inputSchema: {
287+
type: "number",
288+
minimum: 0,
289+
exclusiveMaximum: videoEl.duration
290+
},
291+
expectedOutputs: {
292+
types: ["image"]
293+
},
294+
async execute(timestamp) {
295+
if (mutex) {
296+
// Since we're seeking a single video element, guard against concurrent calls.
297+
await mutex;
298+
}
299+
try {
300+
mutex = new Promise(r => resolveMutex = r);
301+
302+
if (Math.abs(videoEl.currentTime - timestamp) > 0.001) {
303+
videoEl.currentTime = timestamp;
304+
await new Promise(r => videoEl.addEventListener("seeked", r, { once: true }));
305+
}
306+
await new Promise(r => videoEl.requestVideoFrameCallback(r));
307+
308+
return [{ type: "image", value: videoEl }];
309+
} finally {
310+
resolveMutex();
311+
mutex = null;
312+
}
313+
}
314+
}
315+
]
316+
});
317+
```
318+
319+
Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.
320+
321+
Similarly, expected output languages can be provided (via `expectedOutputs: { languages: ["ja" ] }`) or similar, to get an early failure if the model doesn't support processing tool outputs in those languages. However, unlike modalities, there is no prompt-time checking of the tool call result's languages.
322+
323+
The above example shows a single-item array, but just like with prompt inputs, it's allowed to include multiple tool outputs. The same rules are followed as for inputs, e.g., concatenation of adjacent text chunks is done with a single space character.
324+
272325
### Structured output with JSON schema or RegExp constraints
273326

274327
To help with programmatic processing of language model responses, the prompt API supports constraining the response with either a JSON schema object or a `RegExp` passed as the `responseConstraint` option:

index.bs

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ callback LanguageModelToolFunction = Promise<DOMString> (any... arguments);
8282
dictionary LanguageModelTool {
8383
required DOMString name;
8484
required DOMString description;
85+
LanguageModelExpected expectedOutputs;
8586
// JSON schema for the input parameters.
8687
required object inputSchema;
8788
// The function to be invoked by user agent on behalf of language model.
@@ -135,14 +136,17 @@ typedef (
135136

136137
dictionary LanguageModelMessage {
137138
required LanguageModelMessageRole role;
138-
139-
// The DOMString branch is shorthand for `[{ type: "text", value: providedValue }]`
140-
required (DOMString or sequence<LanguageModelMessageContent>) content;
141-
139+
required LanguageModelMessageContent content;
142140
boolean prefix = false;
143141
};
144142

145-
dictionary LanguageModelMessageContent {
143+
typedef (
144+
sequence<LanguageModelMessageContentChunk>
145+
// Shorthand for `[{ type: "text", value: providedValue }]`
146+
or DOMString
147+
) LanguageModelMessageContent;
148+
149+
dictionary LanguageModelMessageContentChunk {
146150
required LanguageModelMessageType type;
147151
required LanguageModelMessageValue value;
148152
};

0 commit comments

Comments
 (0)