Skip to content

Commit 60447a4

Browse files
authored
Added Audio.js notebook for JS (#859)
1 parent cf9eab3 commit 60447a4

File tree

2 files changed

+300
-0
lines changed

2 files changed

+300
-0
lines changed

quickstarts-js/Audio.js

Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
/*
2+
* Copyright 2025 Google LLC
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
/* Markdown (render)
18+
# Gemini API: Audio Quickstart
19+
This notebook provides an example of how to prompt Gemini Flash using an audio file. In this case, you'll use a [sound recording](https://www.jfklibrary.org/asset-viewer/archives/jfkwha-006) of President John F. Kennedy’s 1961 State of the Union address.
20+
*/
21+
22+
/* Markdown (render)
23+
## Setup
24+
### Install SDK and set-up the client
25+
26+
### API Key Configuration
27+
28+
To ensure security, avoid hardcoding the API key in frontend code. Instead, set it as an environment variable on the server or local machine.
29+
30+
When using the Gemini API client libraries, the key will be automatically detected if set as either `GEMINI_API_KEY` or `GOOGLE_API_KEY`. If both are set, `GOOGLE_API_KEY` takes precedence.
31+
32+
For instructions on setting environment variables across different operating systems, refer to the official documentation: [Set API Key as Environment Variable](https://ai.google.dev/gemini-api/docs/api-key#set-api-env-var)
33+
34+
In code, the key can then be accessed as:
35+
36+
```js
37+
ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
38+
```
39+
*/
40+
41+
// [CODE STARTS]
42+
module = await import("https://esm.sh/@google/[email protected]");
43+
GoogleGenAI = module.GoogleGenAI;
44+
ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
45+
// [CODE ENDS]
46+
47+
/* Markdown (render)
48+
### Choose a model
49+
50+
Now select the model you want to use in this guide, either by selecting one in the list or writing it down. Keep in mind that some models, like the 2.5 ones are thinking models and thus take slightly more time to respond (cf. [thinking notebook](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/Get_started_thinking.ipynb) for more details and in particular learn how to switch the thinking off).
51+
52+
For more information about all Gemini models, check the [documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for extended information on each of them.
53+
*/
54+
55+
// [CODE STARTS]
56+
MODEL_ID="gemini-2.5-flash" // "gemini-2.5-flash", "gemini-2.5-pro", "gemini-2.0-flash", "gemini-2.5-flash-lite-preview-06-17
57+
// [CODE ENDS]
58+
59+
/* Markdown (render)
60+
### Upload an audio file with the File API
61+
62+
To use an audio file in your prompt, you must first upload it using the [File API](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/File_API.ipynb).
63+
64+
*/
65+
66+
// [CODE STARTS]
67+
AUDIO_URL = "https://storage.googleapis.com/generativeai-downloads/data/State_of_the_Union_Address_30_January_1961.mp3";
68+
audioBlob = await fetch(AUDIO_URL).then(res => res.blob());
69+
audioMime = audioBlob.type || "audio/mpeg";
70+
71+
audioFile = await ai.files.upload({
72+
file: audioBlob,
73+
config: { mimeType: audioMime },
74+
});
75+
76+
// [CODE ENDS]
77+
78+
/* Markdown (render)
79+
## Use the file in your prompt
80+
*/
81+
82+
// [CODE STARTS]
83+
audioResponse = await ai.models.generateContent({
84+
model: MODEL_ID,
85+
contents: [
86+
{ fileData: { fileUri: audioFile.uri, mimeType: audioMime } },
87+
{ text: "Listen carefully to the following audio file. Provide a brief summary." }
88+
],
89+
});
90+
91+
console.log(audioResponse.text);
92+
// [CODE ENDS]
93+
94+
/* Output Sample
95+
96+
In his State of the Union address on January 30, 1961, President John F. Kennedy presented a somber yet determined assessment of the nation's domestic and international challenges.
97+
98+
Domestically, he highlighted a struggling economy marked by recession, high unemployment, stagnant growth, falling farm income, and rising bankruptcies. He also pointed to critical social issues like substandard housing, overcrowded schools, and inadequate healthcare for the aged.
99+
100+
Internationally, Kennedy emphasized the severity of Cold War threats, citing communist pressures in Asia (Laos), civil unrest in Africa (Congo), and the presence of a communist base in Cuba. He also noted the weakening unity within NATO and the ambition for world domination held by the Soviet Union and China. While acknowledging a significant balance of payments deficit and gold outflow, he affirmed the dollar's underlying strength and pledged no devaluation.
101+
102+
To address these issues, Kennedy outlined a comprehensive program:
103+
1. **Economic Revival:** Proposing measures to boost employment, stimulate housing, increase the minimum wage, and provide tax incentives for investment, aiming to showcase the strength of a free economy.
104+
2. **Military Strengthening:** Ordering a full reappraisal of defense strategy, including increased airlift, accelerated Polaris submarine construction, and improved missile programs, to ensure an "invulnerable" and "futilizing" deterrent.
105+
3. **Enhanced Foreign Aid & Development:** Calling for a new, more effective economic assistance program, including a $500 million "Alliance for Progress" for Latin America, and expanding the Food for Peace initiative.
106+
4. **Global Cooperation:** Proposing the formation of a National Peace Corps and inviting the Soviet Union to cooperate on scientific endeavors like weather prediction, satellite communications, and space exploration, seeking to leverage science for peace.
107+
5. **Strengthening Diplomacy:** Emphasizing arms control as a central goal, and bolstering support for the United Nations as a critical instrument for world peace.
108+
109+
Kennedy acknowledged that the path ahead would be difficult, with likely further setbacks before improvement. He called for national unity, candor, and a renewed sense of public service, stating that rank would be determined by the "size of the job he does." He concluded with a call for unwavering dedication, recognizing that the hopes of all mankind rested on America's ability to face these challenges with pride and perseverance.
110+
111+
*/
112+
113+
/* Markdown (render)
114+
## Inline Audio
115+
For small requests you can inline the audio data into the request, like you can with images.
116+
First slice a small part from the audio blob.
117+
*/
118+
119+
// [CODE STARTS]
120+
slicedBlob = audioBlob.slice(0, 160 * 1024); // ~10,000 ms audio slice for 128 kbps audio file
121+
122+
slicedBase64 = await new Promise((resolve) => {
123+
const reader = new FileReader();
124+
reader.onloadend = () => resolve(reader.result.split(',')[1]);
125+
reader.readAsDataURL(slicedBlob);
126+
});
127+
// [CODE ENDS]
128+
129+
/* Markdown (render)
130+
Add it to the list of parts in the prompt:
131+
*/
132+
133+
// [CODE STARTS]
134+
response = await ai.models.generateContent({
135+
model: MODEL_ID,
136+
contents: [
137+
"Describe this audio clip",
138+
{
139+
inlineData: {
140+
data: slicedBase64,
141+
mimeType: "audio/mpeg"
142+
}
143+
}
144+
]
145+
});
146+
147+
console.log(response.text);
148+
// [CODE ENDS]
149+
150+
/* Output Sample
151+
152+
This audio clip features a **male voice speaking clearly and formally**, like an announcer or orator. The content of his speech refers to "The President's State of the Union address to a joint session of the Congress from the rostrum of the House of Representatives, Washington."
153+
154+
The clip begins with a **distinct mechanical click or thud sound**, possibly indicating a microphone being engaged or adjusted. Throughout the entire speech, there is a **consistent, low hum or electrical buzzing sound** present in the background, which might suggest an older recording, broadcast equipment, or ambient noise from the venue. The speech concludes, and the background hum continues briefly before the clip ends.
155+
156+
*/
157+
158+
/* Markdown (render)
159+
Note the following about providing audio as inline data:
160+
161+
- The maximum request size is 20 MB, which includes text prompts, system instructions, and files provided inline. If your file's size will make the total request size exceed 20 MB, then [use the File API](https://ai.google.dev/gemini-api/docs/audio#upload-audio) to upload files.
162+
- If you're using an audio sample multiple times, it is more efficient to [use the File API](https://ai.google.dev/gemini-api/docs/audio#upload-audio).
163+
164+
*/
165+
166+
/* Markdown (render)
167+
### Refer to timestamps in the audio file
168+
A prompt can specify timestamps of the form `MM:SS` to refer to particular sections in an audio file. For example:
169+
`prompt = "Generate a transcript of the speech."`
170+
*/
171+
172+
/* Markdown (render)
173+
### Refer to timestamps in the audio file
174+
A prompt can specify timestamps of the form `MM:SS` to refer to particular sections in an audio file. For example:
175+
*/
176+
177+
// [CODE STARTS]
178+
prompt = "Provide a transcript of the speech between the timestamps 02:30 and 03:29."
179+
180+
response = await ai.models.generateContent({
181+
model: MODEL_ID,
182+
contents: [
183+
prompt,
184+
{ fileData: { fileUri: audioFile.uri, mimeType: audioMime } },
185+
],
186+
});
187+
188+
console.log(response.text);
189+
// [CODE ENDS]
190+
191+
/* Output Sample
192+
193+
I speak today in an hour of national peril and national opportunity. Before my term has ended, we shall have to test anew whether a nation organized and governed, such as ours, can endure. The outcome is by no means certain. The answers are by no means clear. All of us together, this administration, this Congress, this nation, must forge those answers. But today were I to offer after little more than a week in office, detailed legislation to remedy every national ill, the Congress would rightly wonder whether the desire for speed had replaced the duty of responsibility. My remarks therefore will be limited, but they will also be candid. To state the facts frankly is not to despair the future, nor indict the past.
194+
195+
*/
196+
197+
/* Markdown (render)
198+
## Use a Youtube video
199+
*/
200+
201+
// [CODE STARTS]
202+
youtubeUrl = "https://www.youtube.com/watch?v=RDOMKIw1aF4";
203+
204+
prompt = `
205+
Analyze the following YouTube video content. Provide a concise summary covering:
206+
207+
1. **Main Thesis/Claim:** What is the central point the creator is making?
208+
2. **Key Topics:** List the main subjects discussed, referencing specific examples or technologies mentioned (e.g., AI models, programming languages, projects).
209+
3. **Call to Action:** Identify any explicit requests made to the viewer.
210+
4. **Summary:** Provide a concise summary of the video content.
211+
212+
Use the provided title, chapter timestamps/descriptions, and description text for your analysis.
213+
`;
214+
215+
response = await ai.models.generateContent({
216+
model: MODEL_ID,
217+
contents: [
218+
{ text: prompt },
219+
{ fileData: { fileUri: youtubeUrl } }
220+
]
221+
});
222+
223+
console.log(response.text);
224+
225+
// [CODE ENDS]
226+
227+
/* Output Sample
228+
229+
Here's an analysis of the YouTube video content:
230+
231+
---
232+
233+
**1. Main Thesis/Claim:**
234+
Google's Gemini 2.5 Pro Experimental is the best coding AI the creator has ever used, demonstrating exceptional capabilities in logical code generation and particularly in code refactoring, despite some current limitations in accurately interpreting and building complex user interfaces from visual inputs.
235+
236+
**2. Key Topics:**
237+
238+
* **Overall AI Model Evaluation:** The video benchmarks Gemini 2.5 Pro Experimental against other leading AI models (OpenAI GPT-3 mini, GPT-4.5, Claude 3.7 Sonnet, Grok 3 Beta, DeepSeek R1) across various domains including Reasoning & Knowledge, Science, Mathematics, Code Generation, Code Editing, Agentic Coding, Factuality, Visual Reasoning, Image Understanding, Long Context, and Multilingual Performance. Gemini consistently performs well, often leading or being competitive.
239+
* **Code Generation Projects:**
240+
* **Ultimate Tic-Tac-Toe (Java Swing):** Successfully "one-shotted" (generated a complete, functional game in a single prompt) demonstrating strong multi-file Java application creation.
241+
* **Kitten Cannon Clone (p5.js):** Required three prompts ("three-shotted") to correct initial errors (e.g., `TypeError: Cannot set properties of undefined`, `TypeError: sketch.sign is not a function`), showcasing the AI's effective debugging capabilities through iterative feedback.
242+
* **Landing Page Build (Vite, React, Tailwind CSS from Mockup):** Performed poorly in accurately recreating a landing page based on an image mockup, struggling with visual interpretation and requiring significant manual setup.
243+
* **X (Twitter) Website UI Recreation (Single HTML file):** Achieved a decent static visual representation of the Twitter/X desktop UI, although without functionality, reinforcing that front-end visual creation from prompts is still developing.
244+
* **Code Refactoring:** Showcased impressive ability to refactor Rust code by replacing traditional `for` loops with more idiomatic `iter()` methods, including complex logic for conditions and data processing, which the creator notes was superior to other models.
245+
* **Knowledge and Currency:** Gemini 2.5 Pro's training data is highly current (up to March 2025). It successfully leveraged "Grounding with Google Search" to provide the most up-to-date version of React.js.
246+
* **Developer Workflow:** The video highlights the potential for seamless integration of AI into developer workflows, especially for tasks like debugging and refactoring, though manual steps (like copying code into appropriate files) are still present.
247+
248+
**3. Call to Action:**
249+
The creator explicitly asks viewers who have used Gemini 2.5 Pro Experimental (for coding or other tasks) to **leave their thoughts/experiences in the comments below**. He also implicitly encourages viewers to subscribe, like the video, and enable the notification bell.
250+
251+
**4. Summary:**
252+
The video presents a compelling review of Google's new Gemini 2.5 Pro Experimental AI, asserting its superiority as a coding assistant based on the creator's extensive testing. Through practical coding challenges, Gemini successfully generated a full Java Swing Ultimate Tic-Tac-Toe game in one go and iteratively debugged a p5.js Kitten Cannon clone. Its most remarkable performance was in refactoring complex Rust code, where it significantly improved efficiency and idiomatic style, outperforming competing models. While the AI struggled with precise front-end UI generation from image mockups, it demonstrated strong foundational knowledge, current information recall (aided by Google Search integration), and robust problem-solving. The creator concludes that Gemini 2.5 Pro is an "awesome" tool for developers, particularly for backend logic and code optimization, and looks forward to further integration into development environments.
253+
254+
*/
255+
256+
/* Markdown (render)
257+
## Count audio tokens
258+
259+
You can count the number of tokens in your audio file using the [countTokens](https://googleapis.github.io/js-genai/release_docs/classes/models.Models.html#counttokens) method.
260+
261+
Audio files have a fixed per second token rate (more details in the dedicated [count token quickstart](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/Counting_Tokens.ipynb)).
262+
*/
263+
264+
// [CODE STARTS]
265+
countTokensResponse = await ai.models.countTokens({
266+
model: MODEL_ID,
267+
contents: [
268+
{ fileData: { fileUri: audioFile.uri, mimeType: audioMime } },
269+
]
270+
});
271+
272+
console.log("Audio file tokens:", countTokensResponse.totalTokens);
273+
274+
// [CODE ENDS]
275+
276+
/* Output Sample
277+
278+
Audio file tokens: 83528
279+
280+
*/
281+
282+
/* Markdown (render)
283+
## Next Steps
284+
### Useful API references:
285+
286+
More details about Gemini API's [vision capabilities](https://ai.google.dev/gemini-api/docs/vision) in the documentation.
287+
288+
If you want to know about the File API, check its [API reference](https://ai.google.dev/api/files) or the [File API](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/File_API.js) quickstart.
289+
290+
### Related examples
291+
292+
Check this example using the audio files to give you more ideas on what the gemini API can do with them:
293+
* Share [Voice memos](https://github.com/google-gemini/cookbook/blob/main/examples/Voice_memos.ipynb) with Gemini API and brainstorm ideas
294+
295+
### Continue your discovery of the Gemini API
296+
297+
Have a look at the [Video_Understanding](https://github.com/google-gemini/cookbook/blob/main/quickstarts-js/Video_understanding.js) quickstart to learn about another type of media file, then learn more about [prompting with media files](https://ai.google.dev/gemini-api/docs/files#prompt-guide) in the docs, including the supported formats and maximum length for audio files.
298+
299+
*/

quickstarts-js/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,6 @@ Stay tuned, more JavaScript notebooks are on the way!
1919
| Get Started | A comprehensive introduction to the Gemini JS/TS SDK, demonstrating features such as text and multimodal prompting, token counting, system instructions, safety filters, multi-turn chat, output control, function calling, content streaming, file uploads, and using URL or YouTube video context. | Explore core Gemini capabilities in JS/TS | [![Open in AI Studio](https://storage.googleapis.com/generativeai-downloads/images/Open_in_AIStudio.svg)](https://aistudio.google.com/apps/bundled/get_started?showPreview=true) | <img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg" alt="JS" width="20"/> [Get_Started.js](./Get_Started.js) |
2020
| Image Output | Generate and iterate on images using Gemini’s multimodal capabilities. Learn to use text+image responses, edit images mid-conversation, and handle multiple image outputs with chat-style prompting. | Image generation, multimodal output, image editing, iterative refinement | [![Open in AI Studio](https://storage.googleapis.com/generativeai-downloads/images/Open_in_AIStudio.svg)](https://aistudio.google.com/apps/bundled/get_started_image_out?showPreview=true) | <img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg" alt="JS" width="20"/> [ImageOutput.js](./ImageOutput.js) |
2121
| File API | Learn how to upload, use, retrieve, and delete files (text, image, audio, code) with the Gemini File API for multimodal prompts. | File upload, multimodal prompts, text/code/media files | [![Open in AI Studio](https://storage.googleapis.com/generativeai-downloads/images/Open_in_AIStudio.svg)](https://aistudio.google.com/apps/bundled/file_api?showPreview=true) | <img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg" alt="JS" width="20"/> [File_API.js](./File_API.js) |
22+
| Audio | Demonstrates how to use audio files with Gemini: upload, prompt, summarize, transcribe, and analyze audio and YouTube content. | Audio file upload, inline audio, transcription, YouTube analysis | [![Open in AI Studio](https://storage.googleapis.com/generativeai-downloads/images/Open_in_AIStudio.svg)](https://aistudio.google.com/apps/bundled/audio?showPreview=true) | <img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg" alt="JS" width="20"/> [Audio.js](./Audio.js) |
2223
| Get Started LearnLM | Explore LearnLM, an experimental model for AI tutoring, with examples of system instructions for test prep, concept teaching, learning activities, and homework help. | AI tutoring, system instructions, adaptive learning, education | [![Open in AI Studio](https://storage.googleapis.com/generativeai-downloads/images/Open_in_AIStudio.svg)](https://aistudio.google.com/apps/bundled/get_started_learnlm?showPreview=true) | <img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg" alt="JS" width="20"/> [Get_started_LearnLM.js](./Get_started_LearnLM.js) |
2324

0 commit comments

Comments
 (0)