You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Licensed under the Apache License, Version 2.0 (the "License");
5
+
* you may not use this file except in compliance with the License.
6
+
* You may obtain a copy of the License at
7
+
*
8
+
* http://www.apache.org/licenses/LICENSE-2.0
9
+
*
10
+
* Unless required by applicable law or agreed to in writing, software
11
+
* distributed under the License is distributed on an "AS IS" BASIS,
12
+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+
* See the License for the specific language governing permissions and
14
+
* limitations under the License.
15
+
*/
16
+
17
+
/* Markdown (render)
18
+
# Gemini API: Audio Quickstart
19
+
This notebook provides an example of how to prompt Gemini Flash using an audio file. In this case, you'll use a [sound recording](https://www.jfklibrary.org/asset-viewer/archives/jfkwha-006) of President John F. Kennedy’s 1961 State of the Union address.
20
+
*/
21
+
22
+
/* Markdown (render)
23
+
## Setup
24
+
### Install SDK and set-up the client
25
+
26
+
### API Key Configuration
27
+
28
+
To ensure security, avoid hardcoding the API key in frontend code. Instead, set it as an environment variable on the server or local machine.
29
+
30
+
When using the Gemini API client libraries, the key will be automatically detected if set as either `GEMINI_API_KEY` or `GOOGLE_API_KEY`. If both are set, `GOOGLE_API_KEY` takes precedence.
31
+
32
+
For instructions on setting environment variables across different operating systems, refer to the official documentation: [Set API Key as Environment Variable](https://ai.google.dev/gemini-api/docs/api-key#set-api-env-var)
33
+
34
+
In code, the key can then be accessed as:
35
+
36
+
```js
37
+
ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
Now select the model you want to use in this guide, either by selecting one in the list or writing it down. Keep in mind that some models, like the 2.5 ones are thinking models and thus take slightly more time to respond (cf. [thinking notebook](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/Get_started_thinking.ipynb) for more details and in particular learn how to switch the thinking off).
51
+
52
+
For more information about all Gemini models, check the [documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for extended information on each of them.
To use an audio file in your prompt, you must first upload it using the [File API](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/File_API.ipynb).
{text: "Listen carefully to the following audio file. Provide a brief summary."}
88
+
],
89
+
});
90
+
91
+
console.log(audioResponse.text);
92
+
// [CODE ENDS]
93
+
94
+
/* Output Sample
95
+
96
+
In his State of the Union address on January 30, 1961, President John F. Kennedy presented a somber yet determined assessment of the nation's domestic and international challenges.
97
+
98
+
Domestically, he highlighted a struggling economy marked by recession, high unemployment, stagnant growth, falling farm income, and rising bankruptcies. He also pointed to critical social issues like substandard housing, overcrowded schools, and inadequate healthcare for the aged.
99
+
100
+
Internationally, Kennedy emphasized the severity of Cold War threats, citing communist pressures in Asia (Laos), civil unrest in Africa (Congo), and the presence of a communist base in Cuba. He also noted the weakening unity within NATO and the ambition for world domination held by the Soviet Union and China. While acknowledging a significant balance of payments deficit and gold outflow, he affirmed the dollar's underlying strength and pledged no devaluation.
101
+
102
+
To address these issues, Kennedy outlined a comprehensive program:
103
+
1. **Economic Revival:** Proposing measures to boost employment, stimulate housing, increase the minimum wage, and provide tax incentives for investment, aiming to showcase the strength of a free economy.
104
+
2. **Military Strengthening:** Ordering a full reappraisal of defense strategy, including increased airlift, accelerated Polaris submarine construction, and improved missile programs, to ensure an "invulnerable" and "futilizing" deterrent.
105
+
3. **Enhanced Foreign Aid & Development:** Calling for a new, more effective economic assistance program, including a $500 million "Alliance for Progress" for Latin America, and expanding the Food for Peace initiative.
106
+
4. **Global Cooperation:** Proposing the formation of a National Peace Corps and inviting the Soviet Union to cooperate on scientific endeavors like weather prediction, satellite communications, and space exploration, seeking to leverage science for peace.
107
+
5. **Strengthening Diplomacy:** Emphasizing arms control as a central goal, and bolstering support for the United Nations as a critical instrument for world peace.
108
+
109
+
Kennedy acknowledged that the path ahead would be difficult, with likely further setbacks before improvement. He called for national unity, candor, and a renewed sense of public service, stating that rank would be determined by the "size of the job he does." He concluded with a call for unwavering dedication, recognizing that the hopes of all mankind rested on America's ability to face these challenges with pride and perseverance.
110
+
111
+
*/
112
+
113
+
/* Markdown (render)
114
+
## Inline Audio
115
+
For small requests you can inline the audio data into the request, like you can with images.
116
+
First slice a small part from the audio blob.
117
+
*/
118
+
119
+
// [CODE STARTS]
120
+
slicedBlob=audioBlob.slice(0,160*1024);// ~10,000 ms audio slice for 128 kbps audio file
This audio clip features a **male voice speaking clearly and formally**, like an announcer or orator. The content of his speech refers to "The President's State of the Union address to a joint session of the Congress from the rostrum of the House of Representatives, Washington."
153
+
154
+
The clip begins with a **distinct mechanical click or thud sound**, possibly indicating a microphone being engaged or adjusted. Throughout the entire speech, there is a **consistent, low hum or electrical buzzing sound** present in the background, which might suggest an older recording, broadcast equipment, or ambient noise from the venue. The speech concludes, and the background hum continues briefly before the clip ends.
155
+
156
+
*/
157
+
158
+
/* Markdown (render)
159
+
Note the following about providing audio as inline data:
160
+
161
+
- The maximum request size is 20 MB, which includes text prompts, system instructions, and files provided inline. If your file's size will make the total request size exceed 20 MB, then [use the File API](https://ai.google.dev/gemini-api/docs/audio#upload-audio) to upload files.
162
+
- If you're using an audio sample multiple times, it is more efficient to [use the File API](https://ai.google.dev/gemini-api/docs/audio#upload-audio).
163
+
164
+
*/
165
+
166
+
/* Markdown (render)
167
+
### Refer to timestamps in the audio file
168
+
A prompt can specify timestamps of the form `MM:SS` to refer to particular sections in an audio file. For example:
169
+
`prompt = "Generate a transcript of the speech."`
170
+
*/
171
+
172
+
/* Markdown (render)
173
+
### Refer to timestamps in the audio file
174
+
A prompt can specify timestamps of the form `MM:SS` to refer to particular sections in an audio file. For example:
175
+
*/
176
+
177
+
// [CODE STARTS]
178
+
prompt="Provide a transcript of the speech between the timestamps 02:30 and 03:29."
I speak today in an hour of national peril and national opportunity. Before my term has ended, we shall have to test anew whether a nation organized and governed, such as ours, can endure. The outcome is by no means certain. The answers are by no means clear. All of us together, this administration, this Congress, this nation, must forge those answers. But today were I to offer after little more than a week in office, detailed legislation to remedy every national ill, the Congress would rightly wonder whether the desire for speed had replaced the duty of responsibility. My remarks therefore will be limited, but they will also be candid. To state the facts frankly is not to despair the future, nor indict the past.
Analyze the following YouTube video content. Provide a concise summary covering:
206
+
207
+
1. **Main Thesis/Claim:** What is the central point the creator is making?
208
+
2. **Key Topics:** List the main subjects discussed, referencing specific examples or technologies mentioned (e.g., AI models, programming languages, projects).
209
+
3. **Call to Action:** Identify any explicit requests made to the viewer.
210
+
4. **Summary:** Provide a concise summary of the video content.
211
+
212
+
Use the provided title, chapter timestamps/descriptions, and description text for your analysis.
213
+
`;
214
+
215
+
response=awaitai.models.generateContent({
216
+
model: MODEL_ID,
217
+
contents: [
218
+
{text: prompt},
219
+
{fileData: {fileUri: youtubeUrl}}
220
+
]
221
+
});
222
+
223
+
console.log(response.text);
224
+
225
+
// [CODE ENDS]
226
+
227
+
/* Output Sample
228
+
229
+
Here's an analysis of the YouTube video content:
230
+
231
+
---
232
+
233
+
**1. Main Thesis/Claim:**
234
+
Google's Gemini 2.5 Pro Experimental is the best coding AI the creator has ever used, demonstrating exceptional capabilities in logical code generation and particularly in code refactoring, despite some current limitations in accurately interpreting and building complex user interfaces from visual inputs.
235
+
236
+
**2. Key Topics:**
237
+
238
+
* **Overall AI Model Evaluation:** The video benchmarks Gemini 2.5 Pro Experimental against other leading AI models (OpenAI GPT-3 mini, GPT-4.5, Claude 3.7 Sonnet, Grok 3 Beta, DeepSeek R1) across various domains including Reasoning & Knowledge, Science, Mathematics, Code Generation, Code Editing, Agentic Coding, Factuality, Visual Reasoning, Image Understanding, Long Context, and Multilingual Performance. Gemini consistently performs well, often leading or being competitive.
239
+
* **Code Generation Projects:**
240
+
* **Ultimate Tic-Tac-Toe (Java Swing):** Successfully "one-shotted" (generated a complete, functional game in a single prompt) demonstrating strong multi-file Java application creation.
241
+
* **Kitten Cannon Clone (p5.js):** Required three prompts ("three-shotted") to correct initial errors (e.g., `TypeError: Cannot set properties of undefined`, `TypeError: sketch.sign is not a function`), showcasing the AI's effective debugging capabilities through iterative feedback.
242
+
* **Landing Page Build (Vite, React, Tailwind CSS from Mockup):** Performed poorly in accurately recreating a landing page based on an image mockup, struggling with visual interpretation and requiring significant manual setup.
243
+
* **X (Twitter) Website UI Recreation (Single HTML file):** Achieved a decent static visual representation of the Twitter/X desktop UI, although without functionality, reinforcing that front-end visual creation from prompts is still developing.
244
+
* **Code Refactoring:** Showcased impressive ability to refactor Rust code by replacing traditional `for` loops with more idiomatic `iter()` methods, including complex logic for conditions and data processing, which the creator notes was superior to other models.
245
+
* **Knowledge and Currency:** Gemini 2.5 Pro's training data is highly current (up to March 2025). It successfully leveraged "Grounding with Google Search" to provide the most up-to-date version of React.js.
246
+
* **Developer Workflow:** The video highlights the potential for seamless integration of AI into developer workflows, especially for tasks like debugging and refactoring, though manual steps (like copying code into appropriate files) are still present.
247
+
248
+
**3. Call to Action:**
249
+
The creator explicitly asks viewers who have used Gemini 2.5 Pro Experimental (for coding or other tasks) to **leave their thoughts/experiences in the comments below**. He also implicitly encourages viewers to subscribe, like the video, and enable the notification bell.
250
+
251
+
**4. Summary:**
252
+
The video presents a compelling review of Google's new Gemini 2.5 Pro Experimental AI, asserting its superiority as a coding assistant based on the creator's extensive testing. Through practical coding challenges, Gemini successfully generated a full Java Swing Ultimate Tic-Tac-Toe game in one go and iteratively debugged a p5.js Kitten Cannon clone. Its most remarkable performance was in refactoring complex Rust code, where it significantly improved efficiency and idiomatic style, outperforming competing models. While the AI struggled with precise front-end UI generation from image mockups, it demonstrated strong foundational knowledge, current information recall (aided by Google Search integration), and robust problem-solving. The creator concludes that Gemini 2.5 Pro is an "awesome" tool for developers, particularly for backend logic and code optimization, and looks forward to further integration into development environments.
253
+
254
+
*/
255
+
256
+
/* Markdown (render)
257
+
## Count audio tokens
258
+
259
+
You can count the number of tokens in your audio file using the [countTokens](https://googleapis.github.io/js-genai/release_docs/classes/models.Models.html#counttokens) method.
260
+
261
+
Audio files have a fixed per second token rate (more details in the dedicated [count token quickstart](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/Counting_Tokens.ipynb)).
More details about Gemini API's [vision capabilities](https://ai.google.dev/gemini-api/docs/vision) in the documentation.
287
+
288
+
If you want to know about the File API, check its [API reference](https://ai.google.dev/api/files) or the [File API](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/File_API.js) quickstart.
289
+
290
+
### Related examples
291
+
292
+
Check this example using the audio files to give you more ideas on what the gemini API can do with them:
293
+
* Share [Voice memos](https://github.com/google-gemini/cookbook/blob/main/examples/Voice_memos.ipynb) with Gemini API and brainstorm ideas
294
+
295
+
### Continue your discovery of the Gemini API
296
+
297
+
Have a look at the [Video_Understanding](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/Video_understanding.js) quickstart to learn about another type of media file, then learn more about [prompting with media files](https://ai.google.dev/gemini-api/docs/files#prompt-guide) in the docs, including the supported formats and maximum length for audio files.
Copy file name to clipboardExpand all lines: quickstarts-js/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,5 +19,6 @@ Stay tuned, more JavaScript notebooks are on the way!
19
19
| Get Started | A comprehensive introduction to the Gemini JS/TS SDK, demonstrating features such as text and multimodal prompting, token counting, system instructions, safety filters, multi-turn chat, output control, function calling, content streaming, file uploads, and using URL or YouTube video context. | Explore core Gemini capabilities in JS/TS |[](https://aistudio.google.com/apps/bundled/get_started?showPreview=true)| <imgsrc="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg"alt="JS"width="20"/> [Get_Started.js](./Get_Started.js)|
20
20
| Image Output | Generate and iterate on images using Gemini’s multimodal capabilities. Learn to use text+image responses, edit images mid-conversation, and handle multiple image outputs with chat-style prompting. | Image generation, multimodal output, image editing, iterative refinement |[](https://aistudio.google.com/apps/bundled/get_started_image_out?showPreview=true)| <imgsrc="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg"alt="JS"width="20"/> [ImageOutput.js](./ImageOutput.js)|
21
21
| File API | Learn how to upload, use, retrieve, and delete files (text, image, audio, code) with the Gemini File API for multimodal prompts. | File upload, multimodal prompts, text/code/media files |[](https://aistudio.google.com/apps/bundled/file_api?showPreview=true)| <imgsrc="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg"alt="JS"width="20"/> [File_API.js](./File_API.js)|
22
+
| Audio | Demonstrates how to use audio files with Gemini: upload, prompt, summarize, transcribe, and analyze audio and YouTube content. | Audio file upload, inline audio, transcription, YouTube analysis |[](https://aistudio.google.com/apps/bundled/audio?showPreview=true)| <imgsrc="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg"alt="JS"width="20"/> [Audio.js](./Audio.js)|
22
23
| Get Started LearnLM | Explore LearnLM, an experimental model for AI tutoring, with examples of system instructions for test prep, concept teaching, learning activities, and homework help. | AI tutoring, system instructions, adaptive learning, education |[](https://aistudio.google.com/apps/bundled/get_started_learnlm?showPreview=true)| <imgsrc="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg"alt="JS"width="20"/> [Get_started_LearnLM.js](./Get_started_LearnLM.js)|
0 commit comments