You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The question is
We have this native audio-audio model named "gemini-2.5-flash-preview-native-audio-dialog"
Looking at model detail at https://ai.google.dev/gemini-api/docs/models#gemini-2.5-flash-native-audio
it does not provide structured outputs.
Looking at Gemini Live API https://ai.google.dev/gemini-api/docs/live-guide#establish-connection which is supposed to be used with this model :
You can only set one modality in the response_modalities field. This means that you can configure the model to respond with either text or audio, but not both in the same session.
Therefore, you will set modality AUDIO and that's it no more text on output that can be used in agentic workflow to pass/process
All you can actually do is Audio transcriptions at https://ai.google.dev/gemini-api/docs/live-guide#audio-transcription
which will provide you with word-to-word text transcription of your audio conversation.
Is this actually the way how it is meant to be used? To be just stupid audio conversation with transcription (mb tool calling) and at the end you have to serialize it with other agent using other model, that will just take that transcription and will analyze it / provide report etc?
If so, how actually are we supposed to use them?
Langgraph have no support for google audio models, so you have to do your own custom node.
But, wait google now has google agent development toolkit.
They have developed this simple agent with google_search tool , that actually is using gemini live api with ai agents at https://google.github.io/adk-docs/streaming/
But wait? there is no implementation for input transcription??
So please someone explain to me, how are we actually supposed to use them????
Are they just "technology preview rn" and if you want something serious you have to look for OpenAI gpt4o models that have audio-audio modality? (only ones rn except this gemini)
Thanks in advance
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
The question is
We have this native audio-audio model named "gemini-2.5-flash-preview-native-audio-dialog"
Looking at model detail at https://ai.google.dev/gemini-api/docs/models#gemini-2.5-flash-native-audio
it does not provide structured outputs.
Looking at Gemini Live API https://ai.google.dev/gemini-api/docs/live-guide#establish-connection which is supposed to be used with this model :
You can only set one modality in the response_modalities field. This means that you can configure the model to respond with either text or audio, but not both in the same session.
Therefore, you will set modality AUDIO and that's it no more text on output that can be used in agentic workflow to pass/process
All you can actually do is Audio transcriptions at
https://ai.google.dev/gemini-api/docs/live-guide#audio-transcription
which will provide you with word-to-word text transcription of your audio conversation.
Is this actually the way how it is meant to be used? To be just stupid audio conversation with transcription (mb tool calling) and at the end you have to serialize it with other agent using other model, that will just take that transcription and will analyze it / provide report etc?
If so, how actually are we supposed to use them?
Langgraph have no support for google audio models, so you have to do your own custom node.
But, wait google now has google agent development toolkit.
They have developed this simple agent with google_search tool , that actually is using gemini live api with ai agents at https://google.github.io/adk-docs/streaming/
But wait? there is no implementation for input transcription??
So please someone explain to me, how are we actually supposed to use them????
Are they just "technology preview rn" and if you want something serious you have to look for OpenAI gpt4o models that have audio-audio modality? (only ones rn except this gemini)
Thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions