Add multimodal tool outputs #149

domenic · 2025-08-28T05:19:55Z

(Note that the PR diff involves moving the whole tool use section down below the multimodal inputs section. The new parts are in the "Tool return values" subsection.)

Potential points of discussion:

How do we feel about the expectedOutputs design I added here? It reuses existing types and patterns, so is kind of nice. And it could be expanded in the future with expectedOutputs: { schema: ... } } for Tool-calling: would output schemas be useful? #137. (It's slightly displeasing to have a nested object instead of matching MCP's outputSchema though.)
In my example I used a non-object for my input schema. I wonder if that will actually work with our current implementations; has anyone tested?
IDL bikeshedding: I renamed the { type, value } tuple from LanguageModelMessageContent to LanguageModelMessageContentChunk, so that we could use LanguageModelMessageContent for the typedef of string or { type, value }. Does that seem OK? (It's unobservable to web content, like all dictionary and typedef names.)

~~Note that we should probably merge this after #148, and then we can add a forward-reference discussing the connection between avoiding concurrency and the mutex pattern I use here.~~Done

Preview | Diff

michaelwasserman · 2025-08-28T17:17:32Z

Heads up @FrankLi-MSFT, @sushraja-msft, @bokand, @bwalderman; your thoughts are appreciated!

michaelwasserman

lgtm with mostly minor comments/qs, ty!

michaelwasserman · 2025-08-28T17:38:05Z

README.md

+
+Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.
+
+Similarly, expected output languages can be provided (via `expectedOutputs: { languages: ["ja" ] }`) or similar, to get an early failure if the model doesn't support processing tool outputs in those languages. However, unlike modalities, there is no prompt-time checking of the tool call result's languages.


Can you clarify "there is no prompt-time checking of the tool call result's languages"?

IIUC: impls needn't check the language of tool response strings against the expected set? Also, impls can (and probably should?), check the tool expectedOutputs languages against the specified expectedInputLanguages in the call to create(), right?

IIUC: impls needn't check the language of tool response strings against the expected set?

Yes, that's what I meant. In more detail:

If you have expectedOutputs: { types: ["text"] }, or just omit expectedOutputs so you get the default of only-text, and then your tool returns [{ type: "image", value: whatever }], the implementation will fail the tool call.

However, if you have expectedOutputs: { languages: ["ja"] }, and then your tool returns "Hello this is English", the implementation will not fail your tool call.

Also, impls can (and probably should?), check the tool expectedOutputs languages against the specified expectedInputLanguages in the call to create(), right?

I think they're separate. If your tool is a translation tool, for example, your expected prompt input languages and your expected tool output languages are quite different.

README.md

michaelwasserman · 2025-08-28T17:52:07Z

README.md

+        minimum: 0,
+        exclusiveMaximum: videoEl.duration
+      },
+      expectedOutputs: {


I know we considered requiring expectedInputTypes to include the modalities returned by tools, should that be mentioned, and should this example follow that requirement/guidance?

I don't think that's necessary. Similar to the above, prompt inputs and tool outputs are separate things. You seem to be thinking that tool outputs are a subset of prompt inputs, but I don't think that's the right model.

Both developer-supplied lists need to be checked to see if the overall prompt API implementation supports those modalities/languages. But one is not a subset of the other.

README.md

beaufortfrancois · 2025-09-02T11:46:18Z

README.md

+      inputSchema: {
+        type: "number",
+        minimum: 0,
+        exclusiveMaximum: videoEl.duration


I believe video.currentTime = video.duration is valid to get the last frame, so we should consider using maximum instead of exclusiveMaximum

README.md

beaufortfrancois · 2025-09-02T11:58:30Z

README.md

+});
+```
+
+Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.


Do we know already the type of error the session creation will fail with if the model doesn't support processing multimodal tool outputs?

It would be a "NotSupportedError" DOMException. I'll incorporate that.

beaufortfrancois · 2025-09-02T12:14:36Z

README.md

+const result = await session.prompt("Which of these locations currently has the highest temperature? Seattle, Tokyo, Berlin");
+```
+
+might call the above `"getWeather"` tool's `execute()` function three times. The model would wait for all tool call results to return, using the equivalent of `Promise.all()` internally, before it composes its final response.


If one of the tool calls fail, which error would be surfaced to the prompt() call?

The error thrown by the tool. I think this is implied by the Promise.all() reference?

Then, this mean session.prompt may fail with a NotSupportedError for instance that does not come from the prompt spec errors developers are currently expecting, but from the tool itself.
Is this a pattern that already exists in the web platform world?

It would not fail with a "NotSupportedError" DOMException, unless that's what the web developer threw from their execute() function. It would fail with whatever exception the developers threw.

Rethrowing exceptions that developers throw is common, e.g., it's done by setTimeout() or other async scheduling functions.

Co-authored-by: François Beaufort <[email protected]>

domenic mentioned this pull request Aug 28, 2025

Tool calling: return types? #138

Open

michaelwasserman approved these changes Aug 28, 2025

View reviewed changes

domenic added 2 commits September 1, 2025 13:28

Add multimodal tool outputs

56caea0

Fix concatenation note

5bbaad4

domenic force-pushed the multimodal-tools branch from 69d7bbe to 5bbaad4 Compare September 1, 2025 04:31

beaufortfrancois reviewed Sep 2, 2025

View reviewed changes

types -> type

6e731c7

Co-authored-by: François Beaufort <[email protected]>


		Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.

		Similarly, expected output languages can be provided (via `expectedOutputs: { languages: ["ja" ] }`) or similar, to get an early failure if the model doesn't support processing tool outputs in those languages. However, unlike modalities, there is no prompt-time checking of the tool call result's languages.

Add multimodal tool outputs #149

Are you sure you want to change the base?

Add multimodal tool outputs #149

Uh oh!

Conversation

domenic commented Aug 28, 2025 • edited by pr-preview bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelwasserman commented Aug 28, 2025

Uh oh!

michaelwasserman left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

domenic commented Aug 28, 2025 •

edited by pr-preview bot

Loading