-
Notifications
You must be signed in to change notification settings - Fork 67
Add multimodal tool outputs #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Heads up @FrankLi-MSFT, @sushraja-msft, @bokand, @bwalderman; your thoughts are appreciated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with mostly minor comments/qs, ty!
README.md
Outdated
|
||
Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it. | ||
|
||
Similarly, expected output languages can be provided (via `expectedOutputs: { languages: ["ja" ] }`) or similar, to get an early failure if the model doesn't support processing tool outputs in those languages. However, unlike modalities, there is no prompt-time checking of the tool call result's languages. |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify "there is no prompt-time checking of the tool call result's languages"?
IIUC: impls needn't check the language of tool response strings against the expected set? Also, impls can (and probably should?), check the tool expectedOutputs
languages against the specified expectedInputLanguages
in the call to create()
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC: impls needn't check the language of tool response strings against the expected set?
Yes, that's what I meant. In more detail:
-
If you have
expectedOutputs: { types: ["text"] }
, or just omitexpectedOutputs
so you get the default of only-text, and then your tool returns[{ type: "image", value: whatever }]
, the implementation will fail the tool call. -
However, if you have
expectedOutputs: { languages: ["ja"] }
, and then your tool returns"Hello this is English"
, the implementation will not fail your tool call.
Also, impls can (and probably should?), check the tool expectedOutputs languages against the specified expectedInputLanguages in the call to create(), right?
I think they're separate. If your tool is a translation tool, for example, your expected prompt input languages and your expected tool output languages are quite different.
minimum: 0, | ||
exclusiveMaximum: videoEl.duration | ||
}, | ||
expectedOutputs: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we considered requiring expectedInputTypes to include the modalities returned by tools, should that be mentioned, and should this example follow that requirement/guidance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's necessary. Similar to the above, prompt inputs and tool outputs are separate things. You seem to be thinking that tool outputs are a subset of prompt inputs, but I don't think that's the right model.
Both developer-supplied lists need to be checked to see if the overall prompt API implementation supports those modalities/languages. But one is not a subset of the other.
69d7bbe
to
5bbaad4
Compare
inputSchema: { | ||
type: "number", | ||
minimum: 0, | ||
exclusiveMaximum: videoEl.duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe video.currentTime = video.duration
is valid to get the last frame, so we should consider using maximum
instead of exclusiveMaximum
}); | ||
``` | ||
|
||
Note how the output types need to be specified in the tool definition, so that session creation can fail early if the model doesn't support processing multimodal tool outputs. If the return value contains non-text components without them being present in the tool specification, then the tool call will fail at prompting time, even if the model could support it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know already the type of error the session creation will fail with if the model doesn't support processing multimodal tool outputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a "NotSupportedError"
DOMException
. I'll incorporate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
const result = await session.prompt("Which of these locations currently has the highest temperature? Seattle, Tokyo, Berlin"); | ||
``` | ||
|
||
might call the above `"getWeather"` tool's `execute()` function three times. The model would wait for all tool call results to return, using the equivalent of `Promise.all()` internally, before it composes its final response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one of the tool calls fail, which error would be surfaced to the prompt()
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error thrown by the tool. I think this is implied by the Promise.all()
reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, this mean session.prompt
may fail with a NotSupportedError
for instance that does not come from the prompt spec errors developers are currently expecting, but from the tool itself.
Is this a pattern that already exists in the web platform world?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would not fail with a "NotSupportedError"
DOMException
, unless that's what the web developer threw from their execute()
function. It would fail with whatever exception the developers threw.
Rethrowing exceptions that developers throw is common, e.g., it's done by setTimeout()
or other async scheduling functions.
Co-authored-by: François Beaufort <[email protected]>
(Note that the PR diff involves moving the whole tool use section down below the multimodal inputs section. The new parts are in the "Tool return values" subsection.)
Potential points of discussion:
How do we feel about the
expectedOutputs
design I added here? It reuses existing types and patterns, so is kind of nice. And it could be expanded in the future withexpectedOutputs: { schema: ... } }
for Tool-calling: would output schemas be useful? #137. (It's slightly displeasing to have a nested object instead of matching MCP'soutputSchema
though.)In my example I used a non-object for my input schema. I wonder if that will actually work with our current implementations; has anyone tested?
IDL bikeshedding: I renamed the
{ type, value }
tuple fromLanguageModelMessageContent
toLanguageModelMessageContentChunk
, so that we could useLanguageModelMessageContent
for the typedef ofstring or { type, value }
. Does that seem OK? (It's unobservable to web content, like all dictionary and typedef names.)Note that we should probably merge this after #148, and then we can add a forward-reference discussing the connection between avoiding concurrency and the mutex pattern I use here.DonePreview | Diff