Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 109 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,60 +137,6 @@ const result = await multiUserSession.prompt([

Because of their special behavior of being preserved on context window overflow, system prompts cannot be provided this way.

### Tool use

The Prompt API supports **tool use** via the `tools` option, allowing you to define external capabilities that a language model can invoke in a model-agnostic way. Each tool is represented by an object that includes an `execute` member that specifies the JavaScript function to be called. When the language model initiates a tool use request, the user agent calls the corresponding `execute` function and sends the result back to the model.

Here’s an example of how to use the `tools` option:

```js
const session = await LanguageModel.create({
initialPrompts: [
{
role: "system",
content: `You are a helpful assistant. You can use tools to help the user.`
}
],
tools: [
{
name: "getWeather",
description: "Get the weather in a location.",
inputSchema: {
type: "object",
properties: {
location: {
type: "string",
description: "The city to check for the weather condition.",
},
},
required: ["location"],
},
async execute({ location }) {
const res = await fetch("https://weatherapi.example/?location=" + location);
// Returns the result as a JSON string.
return JSON.stringify(await res.json());
},
}
]
});

const result = await session.prompt("What is the weather in Seattle?");
```

In this example, the `tools` array defines a `getWeather` tool, specifying its name, description, input schema, and `execute` implementation. When the language model determines that a tool call is needed, the user agent invokes the `getWeather` tool's `execute()` function with the provided arguments and returns the result to the model, which can then incorporate it into its response.

#### Concurrent tool use

Developers should be aware that the model might call their tool multiple times, concurrently. For example, code such as

```js
const result = await session.prompt("Which of these locations currently has the highest temperature? Seattle, Tokyo, Berlin");
```

might call the above `"getWeather"` tool's `execute()` function three times. The model would wait for all tool call results to return, using the equivalent of `Promise.all()` internally, before it composes its final response.

Similarly, the model might call multiple different tools, if it believes they all are relevant when responding to the given prompt.

### Multimodal inputs

All of the above examples have been of text prompts. Some language models also support other inputs. Our design initially includes the potential to support images and audio clips as inputs. This is done by using objects in the form `{ type: "image", content }` and `{ type: "audio", content }` instead of strings. The `content` values can be the following:
Expand Down Expand Up @@ -281,6 +227,115 @@ Details:

Future extensions may include more ambitious multimodal inputs, such as video clips, or realtime audio or video. (Realtime might require a different API design, more based around events or streams instead of messages.)

### Tool use

The Prompt API supports **tool use** via the `tools` option, allowing you to define external capabilities that a language model can invoke in a model-agnostic way. Each tool is represented by an object that includes an `execute` member that specifies the JavaScript function to be called. When the language model initiates a tool use request, the user agent calls the corresponding `execute` function and sends the result back to the model.

Here’s an example of how to use the `tools` option:

```js
const session = await LanguageModel.create({
initialPrompts: [
{
role: "system",
content: `You are a helpful assistant. You can use tools to help the user.`
}
],
tools: [
{
name: "getWeather",
description: "Get the weather in a location.",
inputSchema: {
type: "object",
properties: {
location: {
type: "string",
description: "The city to check for the weather condition.",
},
},
required: ["location"],
},
async execute({ location }) {
const res = await fetch("https://weatherapi.example/?location=" + location);
// Returns the result as a JSON string.
return JSON.stringify(await res.json());
},
}
]
});

const result = await session.prompt("What is the weather in Seattle?");
```

In this example, the `tools` array defines a `getWeather` tool, specifying its name, description, input schema, and `execute` implementation. When the language model determines that a tool call is needed, the user agent invokes the `getWeather` tool's `execute()` function with the provided arguments and returns the result to the model, which can then incorporate it into its response.

#### Concurrent tool use

Developers should be aware that the model might call their tool multiple times, concurrently. For example, code such as

```js
const result = await session.prompt("Which of these locations currently has the highest temperature? Seattle, Tokyo, Berlin");
```

might call the above `"getWeather"` tool's `execute()` function three times. The model would wait for all tool call results to return, using the equivalent of `Promise.all()` internally, before it composes its final response.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of the tool calls fail, which error would be surfaced to the prompt() call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error thrown by the tool. I think this is implied by the Promise.all() reference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, this mean session.prompt may fail with a NotSupportedError for instance that does not come from the prompt spec errors developers are currently expecting, but from the tool itself.
Is this a pattern that already exists in the web platform world?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not fail with a "NotSupportedError" DOMException, unless that's what the web developer threw from their execute() function. It would fail with whatever exception the developers threw.

Rethrowing exceptions that developers throw is common, e.g., it's done by setTimeout() or other async scheduling functions.


Similarly, the model might call multiple different tools, if it believes they all are relevant when responding to the given prompt.

If a developer's `execute()` function is not safe against being called multiple times concurrently, e.g., because it accesses some shared resource, then the developer is responsible for writing appropriate code to suspend execution until the resource is available. The following section contains an example of such code.

#### Tool return values

The above example shows tools returning a string. (In fact, stringified JSON.) Models which support [multimodal inputs](#multimodal-inputs) might also support interpreting image or audio results from tool calls.

Just like the `content` option to a `prompt()` call can accept either a string or an array of `{ type, value }` objects, web developer-provided tools can return either a string or such an array. Here's an example:

```js
let mutex, resolveMutex;

const session = await LanguageModel.create({
tools: [
{
name: "grabKeyframe",
description: "Grab a keyframe from the video we're analyzing at the given time",
inputSchema: {
type: "number",
minimum: 0,
exclusiveMaximum: videoEl.duration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe video.currentTime = video.duration is valid to get the last frame, so we should consider using maximum instead of exclusiveMaximum

},
expectedOutputs: {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we considered requiring expectedInputTypes to include the modalities returned by tools, should that be mentioned, and should this example follow that requirement/guidance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary. Similar to the above, prompt inputs and tool outputs are separate things. You seem to be thinking that tool outputs are a subset of prompt inputs, but I don't think that's the right model.

Both developer-supplied lists need to be checked to see if the overall prompt API implementation supports those modalities/languages. But one is not a subset of the other.

type: ["image"]
},
async execute(timestamp) {
if (mutex) {
// Since we're seeking a single video element, guard against concurrent calls.
await mutex;
}
try {
mutex = new Promise(r => resolveMutex = r);

if (Math.abs(videoEl.currentTime - timestamp) > 0.001) {
videoEl.currentTime = timestamp;
await new Promise(r => videoEl.addEventListener("seeked", r, { once: true }));
}
await new Promise(r => videoEl.requestVideoFrameCallback(r));

return [{ type: "image", value: videoEl }];
} finally {
resolveMutex();
mutex = null;
}
}
}
]
});
```

Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know already the type of error the session creation will fail with if the model doesn't support processing multimodal tool outputs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a "NotSupportedError" DOMException. I'll incorporate that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


Similarly, expected output languages can be provided (via `expectedOutputs: { languages: ["ja"] }`) or similar, to get an early failure if the model doesn't support processing tool outputs in those languages. However, unlike modalities, there is no prompt-time checking of the tool call result's languages.

The above example shows a single-item array, but just like with prompt inputs, it's allowed to include multiple tool outputs. The same rules are followed as for inputs, e.g., concatenation of adjacent text chunks is simple string concatenation with no space or other characters inserted.

### Structured output with JSON schema or RegExp constraints

To help with programmatic processing of language model responses, the prompt API supports constraining the response with either a JSON schema object or a `RegExp` passed as the `responseConstraint` option:
Expand Down
43 changes: 24 additions & 19 deletions index.bs
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ callback LanguageModelToolFunction = Promise<DOMString> (any... arguments);
dictionary LanguageModelTool {
required DOMString name;
required DOMString description;
LanguageModelExpected expectedOutputs;
// JSON schema for the input parameters.
required object inputSchema;
// The function to be invoked by user agent on behalf of language model.
Expand Down Expand Up @@ -135,14 +136,17 @@ typedef (

dictionary LanguageModelMessage {
required LanguageModelMessageRole role;

// The DOMString branch is shorthand for `[{ type: "text", value: providedValue }]`
required (DOMString or sequence<LanguageModelMessageContent>) content;

required LanguageModelMessageContent content;
boolean prefix = false;
};

dictionary LanguageModelMessageContent {
typedef (
sequence<LanguageModelMessageContentChunk>
// Shorthand for `[{ type: "text", value: providedValue }]`
or DOMString
) LanguageModelMessageContent;

dictionary LanguageModelMessageContentChunk {
required LanguageModelMessageType type;
required LanguageModelMessageValue value;
};
Expand All @@ -164,7 +168,8 @@ typedef (
<p class="note">This will be incorporated into a proper part of the specification later. For now, we're just writing out this algorithm as a full spec, since it's complicated.</p>

<div algorithm>
To <dfn>validate and canonicalize a prompt</dfn> given a {{LanguageModelPrompt}} |input|, a [=list=] of {{LanguageModelMessageType}}s |expectedTypes|, and a boolean |isInitial|, perform the following steps. The return value will be a non-empty [=list=] of {{LanguageModelMessage}}s in their "longhand" form.
<!-- TODO remove noexport once there are actual references to this algorithm in the spec. It is only being used now to silence a build warning. -->
To <dfn noexport>validate and canonicalize a prompt</dfn> given a {{LanguageModelPrompt}} |input|, a [=list=] of {{LanguageModelMessageType}}s |expectedTypes|, and a boolean |isInitial|, perform the following steps. The return value will be a non-empty [=list=] of {{LanguageModelMessage}}s in their "longhand" form.

1. [=Assert=]: |expectedTypes| [=list/contains=] "{{LanguageModelMessageType/text}}".

Expand All @@ -173,8 +178,8 @@ typedef (
"{{LanguageModelMessage/role}}" → "{{LanguageModelMessageRole/user}}",
"{{LanguageModelMessage/content}}" → «
«[
"{{LanguageModelMessageContent/type}}" → "{{LanguageModelMessageType/text}}",
"{{LanguageModelMessageContent/value}}" → |input|
"{{LanguageModelMessageContentChunk/type}}" → "{{LanguageModelMessageType/text}}",
"{{LanguageModelMessageContentChunk/value}}" → |input|
»,
"{{LanguageModelMessage/prefix}}" → false
Expand All @@ -193,8 +198,8 @@ typedef (
"{{LanguageModelMessage/role}}" → |message|["{{LanguageModelMessage/role}}"],
"{{LanguageModelMessage/content}}" → «
«[
"{{LanguageModelMessageContent/type}}" → "{{LanguageModelMessageType/text}}",
"{{LanguageModelMessageContent/value}}" → |message|
"{{LanguageModelMessageContentChunk/type}}" → "{{LanguageModelMessageType/text}}",
"{{LanguageModelMessageContentChunk/value}}" → |message|
»,
"{{LanguageModelMessage/prefix}}" → |message|["{{LanguageModelMessage/prefix}}"]
Expand All @@ -218,39 +223,39 @@ typedef (

1. If |message|["{{LanguageModelMessage/role}}"] is not "{{LanguageModelMessageRole/system}}", then set |seenNonSystemRole| to true.

1. If |message|["{{LanguageModelMessage/role}}"] is "{{LanguageModelMessageRole/assistant}}" and |content|["{{LanguageModelMessageContent/type}}"] is not "{{LanguageModelMessageType/text}}", then throw a "{{NotSupportedError}}" {{DOMException}}.
1. If |message|["{{LanguageModelMessage/role}}"] is "{{LanguageModelMessageRole/assistant}}" and |content|["{{LanguageModelMessageContentChunk/type}}"] is not "{{LanguageModelMessageType/text}}", then throw a "{{NotSupportedError}}" {{DOMException}}.

1. If |content|["{{LanguageModelMessageContent/type}}"] is "{{LanguageModelMessageType/text}}" and |content|["{{LanguageModelMessageContent/value}}"] is not a [=string=], then throw a {{TypeError}}.
1. If |content|["{{LanguageModelMessageContentChunk/type}}"] is "{{LanguageModelMessageType/text}}" and |content|["{{LanguageModelMessageContentChunk/value}}"] is not a [=string=], then throw a {{TypeError}}.

1. If |content|["{{LanguageModelMessageContent/type}}"] is "{{LanguageModelMessageType/image}}", then:
1. If |content|["{{LanguageModelMessageContentChunk/type}}"] is "{{LanguageModelMessageType/image}}", then:

1. If |expectedTypes| does not [=list/contain=] "{{LanguageModelMessageType/image}}", then throw a "{{NotSupportedError}}" {{DOMException}}.

1. If |content|["{{LanguageModelMessageContent/value}}"] is not an {{ImageBitmapSource}} or {{BufferSource}}, then throw a {{TypeError}}.
1. If |content|["{{LanguageModelMessageContentChunk/value}}"] is not an {{ImageBitmapSource}} or {{BufferSource}}, then throw a {{TypeError}}.

1. If |content|["{{LanguageModelMessageContent/type}}"] is "{{LanguageModelMessageType/audio}}", then:
1. If |content|["{{LanguageModelMessageContentChunk/type}}"] is "{{LanguageModelMessageType/audio}}", then:

1. If |expectedTypes| does not [=list/contain=] "{{LanguageModelMessageType/audio}}", then throw a "{{NotSupportedError}}" {{DOMException}}.

1. If |content|["{{LanguageModelMessageContent/value}}"] is not an {{AudioBuffer}}, {{BufferSource}}, or {{Blob}}, then throw a {{TypeError}}.
1. If |content|["{{LanguageModelMessageContentChunk/value}}"] is not an {{AudioBuffer}}, {{BufferSource}}, or {{Blob}}, then throw a {{TypeError}}.

1. Let |contentWithContiguousTextCollapsed| be an empty [=list=] of {{LanguageModelMessageContent}}s.

1. Let |lastTextContent| be null.

1. [=list/For each=] |content| of |message|["{{LanguageModelMessage/content}}"]:

1. If |content|["{{LanguageModelMessageContent/type}}"] is "{{LanguageModelMessageType/text}}":
1. If |content|["{{LanguageModelMessageContentChunk/type}}"] is "{{LanguageModelMessageType/text}}":

1. If |lastTextContent| is null:

1. [=list/Append=] |content| to |contentWithContiguousTextCollapsed|.

1. Set |lastTextContent| to |content|.

1. Otherwise, set |lastTextContent|["{{LanguageModelMessageContent/value}}"] to the concatenation of |lastTextContent|["{{LanguageModelMessageContent/value}}"] and |content|["{{LanguageModelMessageContent/value}}"].
1. Otherwise, set |lastTextContent|["{{LanguageModelMessageContentChunk/value}}"] to the concatenation of |lastTextContent|["{{LanguageModelMessageContentChunk/value}}"] and |content|["{{LanguageModelMessageContentChunk/value}}"].

<p class="note">No space or other character is added. Thus, « «[ "{{LanguageModelMessageContent/type}}" → "{{LanguageModelMessageType/text}}", "`foo`" ]», «[ "{{LanguageModelMessageContent/type}}" → "{{LanguageModelMessageType/text}}", "`bar`" ]» » is canonicalized to « «[ "{{LanguageModelMessageContent/type}}" → "{{LanguageModelMessageType/text}}", "`foobar`" ]».</p>
<p class="note">No space or other character is added. Thus, « «[ "{{LanguageModelMessageContentChunk/type}}" → "{{LanguageModelMessageType/text}}", "`foo`" ]», «[ "{{LanguageModelMessageContentChunk/type}}" → "{{LanguageModelMessageType/text}}", "`bar`" ]» » is canonicalized to « «[ "{{LanguageModelMessageContentChunk/type}}" → "{{LanguageModelMessageType/text}}", "`foobar`" ]».</p>

1. Otherwise:

Expand Down