How are we actually supposed to use "gemini-2.5-flash-preview-native-audio-dialog" models ? #1973

smetanokr · 2025-07-15T14:55:13Z

smetanokr
Jul 15, 2025

The question is
We have this native audio-audio model named "gemini-2.5-flash-preview-native-audio-dialog"
Looking at model detail at https://ai.google.dev/gemini-api/docs/models#gemini-2.5-flash-native-audio
it does not provide structured outputs.
Looking at Gemini Live API https://ai.google.dev/gemini-api/docs/live-guide#establish-connection which is supposed to be used with this model :
You can only set one modality in the response_modalities field. This means that you can configure the model to respond with either text or audio, but not both in the same session.
Therefore, you will set modality AUDIO and that's it no more text on output that can be used in agentic workflow to pass/process
All you can actually do is Audio transcriptions at
https://ai.google.dev/gemini-api/docs/live-guide#audio-transcription
which will provide you with word-to-word text transcription of your audio conversation.
Is this actually the way how it is meant to be used? To be just stupid audio conversation with transcription (mb tool calling) and at the end you have to serialize it with other agent using other model, that will just take that transcription and will analyze it / provide report etc?
If so, how actually are we supposed to use them?
Langgraph have no support for google audio models, so you have to do your own custom node.
But, wait google now has google agent development toolkit.
They have developed this simple agent with google_search tool , that actually is using gemini live api with ai agents at https://google.github.io/adk-docs/streaming/
But wait? there is no implementation for input transcription??
So please someone explain to me, how are we actually supposed to use them????
Are they just "technology preview rn" and if you want something serious you have to look for OpenAI gpt4o models that have audio-audio modality? (only ones rn except this gemini)
Thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How are we actually supposed to use "gemini-2.5-flash-preview-native-audio-dialog" models ? #1973

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How are we actually supposed to use "gemini-2.5-flash-preview-native-audio-dialog" models ? #1973

Uh oh!

smetanokr Jul 15, 2025

Replies: 0 comments

smetanokr
Jul 15, 2025