Skip to content

Conversation

domenic
Copy link
Collaborator

@domenic domenic commented Aug 28, 2025

(Note that the PR diff involves moving the whole tool use section down below the multimodal inputs section. The new parts are in the "Tool return values" subsection.)

Potential points of discussion:

  • How do we feel about the expectedOutputs design I added here? It reuses existing types and patterns, so is kind of nice. And it could be expanded in the future with expectedOutputs: { schema: ... } } for Tool-calling: would output schemas be useful? #137. (It's slightly displeasing to have a nested object instead of matching MCP's outputSchema though.)

  • In my example I used a non-object for my input schema. I wonder if that will actually work with our current implementations; has anyone tested?

  • IDL bikeshedding: I renamed the { type, value } tuple from LanguageModelMessageContent to LanguageModelMessageContentChunk, so that we could use LanguageModelMessageContent for the typedef of string or { type, value }. Does that seem OK? (It's unobservable to web content, like all dictionary and typedef names.)

Note that we should probably merge this after #148, and then we can add a forward-reference discussing the connection between avoiding concurrency and the mutex pattern I use here.Done


Preview | Diff

@michaelwasserman
Copy link

Heads up @FrankLi-MSFT, @sushraja-msft, @bokand, @bwalderman; your thoughts are appreciated!

Copy link

@michaelwasserman michaelwasserman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with mostly minor comments/qs, ty!

README.md Outdated

Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.

Similarly, expected output languages can be provided (via `expectedOutputs: { languages: ["ja" ] }`) or similar, to get an early failure if the model doesn't support processing tool outputs in those languages. However, unlike modalities, there is no prompt-time checking of the tool call result's languages.

This comment was marked as resolved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify "there is no prompt-time checking of the tool call result's languages"?

IIUC: impls needn't check the language of tool response strings against the expected set? Also, impls can (and probably should?), check the tool expectedOutputs languages against the specified expectedInputLanguages in the call to create(), right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC: impls needn't check the language of tool response strings against the expected set?

Yes, that's what I meant. In more detail:

  • If you have expectedOutputs: { types: ["text"] }, or just omit expectedOutputs so you get the default of only-text, and then your tool returns [{ type: "image", value: whatever }], the implementation will fail the tool call.

  • However, if you have expectedOutputs: { languages: ["ja"] }, and then your tool returns "Hello this is English", the implementation will not fail your tool call.

Also, impls can (and probably should?), check the tool expectedOutputs languages against the specified expectedInputLanguages in the call to create(), right?

I think they're separate. If your tool is a translation tool, for example, your expected prompt input languages and your expected tool output languages are quite different.

minimum: 0,
exclusiveMaximum: videoEl.duration
},
expectedOutputs: {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we considered requiring expectedInputTypes to include the modalities returned by tools, should that be mentioned, and should this example follow that requirement/guidance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary. Similar to the above, prompt inputs and tool outputs are separate things. You seem to be thinking that tool outputs are a subset of prompt inputs, but I don't think that's the right model.

Both developer-supplied lists need to be checked to see if the overall prompt API implementation supports those modalities/languages. But one is not a subset of the other.

inputSchema: {
type: "number",
minimum: 0,
exclusiveMaximum: videoEl.duration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe video.currentTime = video.duration is valid to get the last frame, so we should consider using maximum instead of exclusiveMaximum

});
```

Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know already the type of error the session creation will fail with if the model doesn't support processing multimodal tool outputs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a "NotSupportedError" DOMException. I'll incorporate that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

const result = await session.prompt("Which of these locations currently has the highest temperature? Seattle, Tokyo, Berlin");
```

might call the above `"getWeather"` tool's `execute()` function three times. The model would wait for all tool call results to return, using the equivalent of `Promise.all()` internally, before it composes its final response.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of the tool calls fail, which error would be surfaced to the prompt() call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error thrown by the tool. I think this is implied by the Promise.all() reference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, this mean session.prompt may fail with a NotSupportedError for instance that does not come from the prompt spec errors developers are currently expecting, but from the tool itself.
Is this a pattern that already exists in the web platform world?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not fail with a "NotSupportedError" DOMException, unless that's what the web developer threw from their execute() function. It would fail with whatever exception the developers threw.

Rethrowing exceptions that developers throw is common, e.g., it's done by setTimeout() or other async scheduling functions.

Co-authored-by: François Beaufort <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants