Skip to content

Commit 5edcbee

Browse files
authored
IO2025: TTS and Live Native Audio models (#772)
* Adding a new TTS guide about speech generation * Adding a new Native Audio-out with Live API python script
1 parent b47bcab commit 5edcbee

File tree

8 files changed

+995
-30
lines changed

8 files changed

+995
-30
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Here are the recent additions and updates to the Gemini API and the Cookbook:
2121

2222
* **Gemini 2.5 models:** Explore the capabilities of the latest Gemini 2.5 models (Flash and Pro)! See the [Get Started Guide](./quickstarts/Get_started.ipynb) and the [thinking guide](./quickstarts/Get_started_thinking.ipynb) as they'll all be thinking ones.
2323
* **Imagen and Veo**: Get started with our media generation model with this brand new [Veo guide](./quickstarts/Get_started_Veo.ipynb) and [Imagen guide](./quickstarts/Get_started_imagen.ipynb)!
24-
* **Lyria**: Get started and music generation with the [Lyria RealTime](./quickstarts/Get_started_LyriaRealTime.ipynb) model.
24+
* **Lyria and TTS**: Get started with podcast and music generation with the [TTS](./quickstarts/Get_started_TTS.ipynb) and [Lyria RealTime](./quickstarts/Get_started_LyriaRealTime.ipynb) models.
2525
* **LiveAPI**: Get started with the [multimodal Live API](./quickstarts/Get_started_LiveAPI.ipynb) and unlock new interactivity with Gemini.
2626
* **Recently Added Guides:**
2727
* [Browser as a tool](./examples/Browser_as_a_tool.ipynb): Use a web browser for live and internal (intranet) web interactions

quickstarts/Get_started_LiveAPI.ipynb

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -57,15 +57,19 @@
5757
"source": [
5858
"**Preview**: The Live API is in preview.\n",
5959
"\n",
60-
"This notebook demonstrates simple usage of the Gemini 2.0 Multimodal Live API. For an overview of new capabilities refer to the [Gemini 2.0 docs](https://ai.google.dev/gemini-api/docs/models/gemini-v2).\n",
60+
"This notebook demonstrates simple usage of the Gemini Multimodal Live API. For an overview of new capabilities refer to the [Gemini Live API docs](https://ai.google.dev/gemini-api/docs/live).\n",
6161
"\n",
6262
"This notebook implements a simple turn-based chat where you send messages as text, and the model replies with audio. The API is capable of much more than that. The goal here is to demonstrate with **simple code**.\n",
6363
"\n",
64-
"Some features of the API are not working in Colab, to try them it is recommended to have a look at this [python script](./Get_started_LiveAPI.py) and run it locally.\n",
64+
"Some features of the API are not working in Colab, to try them it is recommended to have a look at this [Python script](./Get_started_LiveAPI.py) and run it locally.\n",
6565
"\n",
6666
"If you aren't looking for code, and just want to try multimedia streaming use [Live API in Google AI Studio](https://aistudio.google.com/app/live).\n",
6767
"\n",
68-
"The [Next steps](#next_steps) section at the end of this tutorial provides links to additional resources."
68+
"The [Next steps](#next_steps) section at the end of this tutorial provides links to additional resources.\n",
69+
"\n",
70+
"#### Native audio output\n",
71+
"\n",
72+
"**Info**: Gemini 2.5 introduces [native audio generation](https://ai.google.dev/gemini-api/docs/live#native-audio-output), which directly generates audio output, providing a more natural sounding audio, more expressive voices, more awareness of additional context, e.g., tone, and more proactive responses. You can try a native audio example in this [script](./Get_started_LiveAPI_NativeAudio.py)."
6973
]
7074
},
7175
{
@@ -92,7 +96,7 @@
9296
},
9397
{
9498
"cell_type": "code",
95-
"execution_count": 1,
99+
"execution_count": null,
96100
"metadata": {
97101
"id": "46zEFO2a9FFd"
98102
},
@@ -123,7 +127,7 @@
123127
},
124128
{
125129
"cell_type": "code",
126-
"execution_count": 2,
130+
"execution_count": null,
127131
"metadata": {
128132
"id": "A1pkoyZb9Jm3"
129133
},
@@ -148,7 +152,7 @@
148152
},
149153
{
150154
"cell_type": "code",
151-
"execution_count": 3,
155+
"execution_count": null,
152156
"metadata": {
153157
"id": "HghvVpbU0Uap"
154158
},
@@ -172,7 +176,7 @@
172176
},
173177
{
174178
"cell_type": "code",
175-
"execution_count": 4,
179+
"execution_count": null,
176180
"metadata": {
177181
"id": "27Fikag0xSaB"
178182
},
@@ -194,7 +198,7 @@
194198
},
195199
{
196200
"cell_type": "code",
197-
"execution_count": 5,
201+
"execution_count": null,
198202
"metadata": {
199203
"id": "Yd1vs3cP8EmS"
200204
},
@@ -228,7 +232,7 @@
228232
},
229233
{
230234
"cell_type": "code",
231-
"execution_count": 37,
235+
"execution_count": null,
232236
"metadata": {
233237
"id": "dDfslcyIOqgI"
234238
},
@@ -284,7 +288,7 @@
284288
},
285289
{
286290
"cell_type": "code",
287-
"execution_count": 10,
291+
"execution_count": null,
288292
"metadata": {
289293
"id": "7mEDGwJfLRrm"
290294
},
@@ -312,7 +316,7 @@
312316
},
313317
{
314318
"cell_type": "code",
315-
"execution_count": 36,
319+
"execution_count": null,
316320
"metadata": {
317321
"id": "VFD4VleVKj1-"
318322
},
@@ -413,7 +417,7 @@
413417
},
414418
{
415419
"cell_type": "code",
416-
"execution_count": 13,
420+
"execution_count": null,
417421
"metadata": {
418422
"id": "bWTaU8j-X3AJ"
419423
},
@@ -436,7 +440,7 @@
436440
},
437441
{
438442
"cell_type": "code",
439-
"execution_count": 16,
443+
"execution_count": null,
440444
"metadata": {
441445
"id": "3zAjMOZXFuxI"
442446
},
@@ -579,7 +583,7 @@
579583
},
580584
{
581585
"cell_type": "code",
582-
"execution_count": 17,
586+
"execution_count": null,
583587
"metadata": {
584588
"id": "WxdwgTKIGIlY"
585589
},
@@ -669,7 +673,7 @@
669673
},
670674
{
671675
"cell_type": "code",
672-
"execution_count": 18,
676+
"execution_count": null,
673677
"metadata": {
674678
"id": "cbkoDa1ve_C5"
675679
},
@@ -768,7 +772,7 @@
768772
},
769773
{
770774
"cell_type": "code",
771-
"execution_count": 19,
775+
"execution_count": null,
772776
"metadata": {
773777
"id": "yqBTtKvGmKI4"
774778
},
@@ -872,7 +876,7 @@
872876
},
873877
{
874878
"cell_type": "code",
875-
"execution_count": 20,
879+
"execution_count": null,
876880
"metadata": {
877881
"id": "Y5ZVUQ5vJrEJ"
878882
},
@@ -906,7 +910,7 @@
906910
},
907911
{
908912
"cell_type": "code",
909-
"execution_count": 21,
913+
"execution_count": null,
910914
"metadata": {
911915
"id": "xH_iZhTxKFtF"
912916
},
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# -*- coding: utf-8 -*-
2+
# Copyright 2025 Google LLC
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
"""
17+
## Setup
18+
19+
To install the dependencies for this script, run:
20+
21+
```
22+
brew install portaudio
23+
pip install -U google-genai pyaudio
24+
```
25+
26+
## API key
27+
28+
Ensure the `GOOGLE_API_KEY` environment variable is set to the api-key
29+
you obtained from Google AI Studio.
30+
31+
## Run
32+
33+
To run the script:
34+
35+
```
36+
python Get_started_LiveAPI_NativeAudio.py
37+
```
38+
39+
Start talking to Gemini
40+
"""
41+
42+
import asyncio
43+
import sys
44+
import traceback
45+
46+
import pyaudio
47+
48+
from google import genai
49+
50+
if sys.version_info < (3, 11, 0):
51+
import taskgroup, exceptiongroup
52+
53+
asyncio.TaskGroup = taskgroup.TaskGroup
54+
asyncio.ExceptionGroup = exceptiongroup.ExceptionGroup
55+
56+
FORMAT = pyaudio.paInt16
57+
CHANNELS = 1
58+
SEND_SAMPLE_RATE = 16000
59+
RECEIVE_SAMPLE_RATE = 24000
60+
CHUNK_SIZE = 1024
61+
62+
pya = pyaudio.PyAudio()
63+
64+
65+
client = genai.Client() # GOOGLE_API_KEY must be set as env variable
66+
67+
MODEL = "gemini-2.5-flash-preview-native-audio-dialog"
68+
CONFIG = {"response_modalities": ["AUDIO"]}
69+
70+
71+
class AudioLoop:
72+
def __init__(self):
73+
self.audio_in_queue = None
74+
self.out_queue = None
75+
76+
self.session = None
77+
78+
self.audio_stream = None
79+
80+
self.receive_audio_task = None
81+
self.play_audio_task = None
82+
83+
84+
async def listen_audio(self):
85+
mic_info = pya.get_default_input_device_info()
86+
self.audio_stream = await asyncio.to_thread(
87+
pya.open,
88+
format=FORMAT,
89+
channels=CHANNELS,
90+
rate=SEND_SAMPLE_RATE,
91+
input=True,
92+
input_device_index=mic_info["index"],
93+
frames_per_buffer=CHUNK_SIZE,
94+
)
95+
if __debug__:
96+
kwargs = {"exception_on_overflow": False}
97+
else:
98+
kwargs = {}
99+
while True:
100+
data = await asyncio.to_thread(self.audio_stream.read, CHUNK_SIZE, **kwargs)
101+
await self.out_queue.put({"data": data, "mime_type": "audio/pcm"})
102+
103+
async def send_realtime(self):
104+
while True:
105+
msg = await self.out_queue.get()
106+
await self.session.send_realtime_input(audio=msg)
107+
108+
async def receive_audio(self):
109+
"Background task to reads from the websocket and write pcm chunks to the output queue"
110+
while True:
111+
turn = self.session.receive()
112+
async for response in turn:
113+
if data := response.data:
114+
self.audio_in_queue.put_nowait(data)
115+
continue
116+
if text := response.text:
117+
print(text, end="")
118+
119+
# If you interrupt the model, it sends a turn_complete.
120+
# For interruptions to work, we need to stop playback.
121+
# So empty out the audio queue because it may have loaded
122+
# much more audio than has played yet.
123+
while not self.audio_in_queue.empty():
124+
self.audio_in_queue.get_nowait()
125+
126+
async def play_audio(self):
127+
stream = await asyncio.to_thread(
128+
pya.open,
129+
format=FORMAT,
130+
channels=CHANNELS,
131+
rate=RECEIVE_SAMPLE_RATE,
132+
output=True,
133+
)
134+
while True:
135+
bytestream = await self.audio_in_queue.get()
136+
await asyncio.to_thread(stream.write, bytestream)
137+
138+
async def run(self):
139+
try:
140+
async with (
141+
client.aio.live.connect(model=MODEL, config=CONFIG) as session,
142+
asyncio.TaskGroup() as tg,
143+
):
144+
self.session = session
145+
146+
self.audio_in_queue = asyncio.Queue()
147+
self.out_queue = asyncio.Queue(maxsize=5)
148+
149+
tg.create_task(self.send_realtime())
150+
tg.create_task(self.listen_audio())
151+
tg.create_task(self.receive_audio())
152+
tg.create_task(self.play_audio())
153+
except asyncio.CancelledError:
154+
pass
155+
except ExceptionGroup as EG:
156+
if self.audio_stream:
157+
self.audio_stream.close()
158+
traceback.print_exception(EG)
159+
160+
161+
if __name__ == "__main__":
162+
loop = AudioLoop()
163+
asyncio.run(loop.run())

quickstarts/Get_started_LyriaRealTime.ipynb

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -664,6 +664,7 @@
664664
"# What's next?\n",
665665
"\n",
666666
"Now that you know how to generate music, here are other cool things to try:\n",
667+
"* Instead of music, learn how to generate multi-speakers conversation using the [TTS models](./Get_started_TTS.ipynb),\n",
667668
"* Discover how to generate [images](./Get_started_imagen.ipynb) or [videos](./Get_started_Veo.ipynb),\n",
668669
"* Instead of generation music or audio, find out how to Gemini can [understand Audio files](./Audio.ipynb),\n",
669670
"* Have a real-time conversation with Gemini using the [Live API](./Get_started_LiveAPI.ipynb)."

0 commit comments

Comments
 (0)