Skip to content

Commit 536c1d6

Browse files
Merge pull request #2152 from mohantym:audio_sample_audio
PiperOrigin-RevId: 496383025
2 parents d5ad3ee + 47f6600 commit 536c1d6

File tree

1 file changed

+15
-17
lines changed

1 file changed

+15
-17
lines changed

site/en/tutorials/audio/simple_audio.ipynb

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -74,9 +74,9 @@
7474
"id": "SPfDNFlb66XF"
7575
},
7676
"source": [
77-
"This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic <a href=\"https://en.wikipedia.org/wiki/Speech_recognition\" class=\"external\">automatic speech recognition</a> (ASR) model for recognizing ten different words. You will use a portion of the [Speech Commands dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) (<a href=\"https://arxiv.org/abs/1804.03209\" class=\"external\">Warden, 2018</a>), which contains short (one-second or less) audio clips of commands, such as \"down\", \"go\", \"left\", \"no\", \"right\", \"stop\", \"up\" and \"yes\".\n",
77+
"This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic [automatic speech recognition](https://en.wikipedia.org/wiki/Speech_recognition) (ASR) model for recognizing ten different words. You will use a portion of the [Speech Commands dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) ([Warden, 2018](https://arxiv.org/abs/1804.03209)), which contains short (one-second or less) audio clips of commands, such as \"down\", \"go\", \"left\", \"no\", \"right\", \"stop\", \"up\" and \"yes\".\n",
7878
"\n",
79-
"Real-world speech and audio recognition <a href=\"https://ai.googleblog.com/search/label/Speech%20Recognition\" class=\"external\">systems</a> are complex. But, like [image classification with the MNIST dataset](../quickstart/beginner.ipynb), this tutorial should give you a basic understanding of the techniques involved."
79+
"Real-world speech and audio recognition [systems](https://ai.googleblog.com/search/label/Speech%20Recognition) are complex. But, like [image classification with the MNIST dataset](../quickstart/beginner.ipynb), this tutorial should give you a basic understanding of the techniques involved."
8080
]
8181
},
8282
{
@@ -87,7 +87,7 @@
8787
"source": [
8888
"## Setup\n",
8989
"\n",
90-
"Import necessary modules and dependencies. Note that you'll be using <a href=\"https://seaborn.pydata.org/\" class=\"external\">seaborn</a> for visualization in this tutorial."
90+
"Import necessary modules and dependencies. You'll be using `tf.keras.utils.audio_dataset_from_directory` (introduced in TensorFlow 2.10), which helps generate audio classification datasets from directories of `.wav` files. You'll also need [seaborn](https://seaborn.pydata.org) for visualization in this tutorial."
9191
]
9292
},
9393
{
@@ -98,9 +98,8 @@
9898
},
9999
"outputs": [],
100100
"source": [
101-
"# Upgrade environment to support TF 2.10 in Colab\n",
102-
"!pip install -U --pre tensorflow tensorflow_datasets\n",
103-
"!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2"
101+
"!pip install -U -q tensorflow tensorflow_datasets\n",
102+
"!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2"
104103
]
105104
},
106105
{
@@ -137,7 +136,7 @@
137136
"source": [
138137
"## Import the mini Speech Commands dataset\n",
139138
"\n",
140-
"To save time with data loading, you will be working with a smaller version of the Speech Commands dataset. The [original dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) consists of over 105,000 audio files in the <a href=\"https://www.aelius.com/njh/wavemetatools/doc/riffmci.pdf\" class=\"external\">WAV (Waveform) audio file format</a> of people saying 35 different words. This data was collected by Google and released under a CC BY license.\n",
139+
"To save time with data loading, you will be working with a smaller version of the Speech Commands dataset. The [original dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) consists of over 105,000 audio files in the [WAV (Waveform) audio file format](https://www.aelius.com/njh/wavemetatools/doc/riffmci.pdf) of people saying 35 different words. This data was collected by Google and released under a CC BY license.\n",
141140
"\n",
142141
"Download and extract the `mini_speech_commands.zip` file containing the smaller Speech Commands datasets with `tf.keras.utils.get_file`:"
143142
]
@@ -351,14 +350,14 @@
351350
"source": [
352351
"## Convert waveforms to spectrograms\n",
353352
"\n",
354-
"The waveforms in the dataset are represented in the time domain. Next, you'll transform the waveforms from the time-domain signals into the time-frequency-domain signals by computing the <a href=\"https://en.wikipedia.org/wiki/Short-time_Fourier_transform\" class=\"external\">short-time Fourier transform (STFT)</a> to convert the waveforms to as <a href=\"https://en.wikipedia.org/wiki/Spectrogram\" clas=\"external\">spectrograms</a>, which show frequency changes over time and can be represented as 2D images. You will feed the spectrogram images into your neural network to train the model.\n",
353+
"The waveforms in the dataset are represented in the time domain. Next, you'll transform the waveforms from the time-domain signals into the time-frequency-domain signals by computing the [short-time Fourier transform (STFT)](https://en.wikipedia.org/wiki/Short-time_Fourier_transform) to convert the waveforms to as [spectrograms](https://en.wikipedia.org/wiki/Spectrogram), which show frequency changes over time and can be represented as 2D images. You will feed the spectrogram images into your neural network to train the model.\n",
355354
"\n",
356355
"A Fourier transform (`tf.signal.fft`) converts a signal to its component frequencies, but loses all time information. In comparison, STFT (`tf.signal.stft`) splits the signal into windows of time and runs a Fourier transform on each window, preserving some time information, and returning a 2D tensor that you can run standard convolutions on.\n",
357356
"\n",
358357
"Create a utility function for converting waveforms to spectrograms:\n",
359358
"\n",
360359
"- The waveforms need to be of the same length, so that when you convert them to spectrograms, the results have similar dimensions. This can be done by simply zero-padding the audio clips that are shorter than one second (using `tf.zeros`).\n",
361-
"- When calling `tf.signal.stft`, choose the `frame_length` and `frame_step` parameters such that the generated spectrogram \"image\" is almost square. For more information on the STFT parameters choice, refer to <a href=\"https://www.coursera.org/lecture/audio-signal-processing/stft-2-tjEQe\" class=\"external\">this Coursera video</a> on audio signal processing and STFT.\n",
360+
"- When calling `tf.signal.stft`, choose the `frame_length` and `frame_step` parameters such that the generated spectrogram \"image\" is almost square. For more information on the STFT parameters choice, refer to [this Coursera video](https://www.coursera.org/lecture/audio-signal-processing/stft-2-tjEQe) on audio signal processing and STFT.\n",
362361
"- The STFT produces an array of complex numbers representing magnitude and phase. However, in this tutorial you'll only use the magnitude, which you can derive by applying `tf.abs` on the output of `tf.signal.stft`."
363362
]
364363
},
@@ -751,7 +750,7 @@
751750
"source": [
752751
"### Display a confusion matrix\n",
753752
"\n",
754-
"Use a <a href=\"https://developers.google.com/machine-learning/glossary#confusion-matrix\" class=\"external\">confusion matrix</a> to check how well the model did classifying each of the commands in the test set:\n"
753+
"Use a [confusion matrix](https://developers.google.com/machine-learning/glossary#confusion-matrix) to check how well the model did classifying each of the commands in the test set:\n"
755754
]
756755
},
757756
{
@@ -834,7 +833,8 @@
834833
"x = x[tf.newaxis,...]\n",
835834
"\n",
836835
"prediction = model(x)\n",
837-
"plt.bar(commands, tf.nn.softmax(prediction[0]))\n",
836+
"x_labels = ['no', 'yes', 'down', 'go', 'left', 'up', 'right', 'stop']\n",
837+
"plt.bar(x_labels, tf.nn.softmax(prediction[0]))\n",
838838
"plt.title('No')\n",
839839
"plt.show()\n",
840840
"\n",
@@ -961,12 +961,12 @@
961961
"This tutorial demonstrated how to carry out simple audio classification/automatic speech recognition using a convolutional neural network with TensorFlow and Python. To learn more, consider the following resources:\n",
962962
"\n",
963963
"- The [Sound classification with YAMNet](https://www.tensorflow.org/hub/tutorials/yamnet) tutorial shows how to use transfer learning for audio classification.\n",
964-
"- The notebooks from <a href=\"https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/overview\" class=\"external\">Kaggle's TensorFlow speech recognition challenge</a>.\n",
964+
"- The notebooks from [Kaggle's TensorFlow speech recognition challenge](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/overview).\n",
965965
"- The \n",
966-
"<a href=\"https://codelabs.developers.google.com/codelabs/tensorflowjs-audio-codelab/index.html#0\" class=\"external\">TensorFlow.js - Audio recognition using transfer learning codelab</a> teaches how to build your own interactive web app for audio classification.\n",
967-
"- <a href=\"https://arxiv.org/abs/1709.04396\" class=\"external\">A tutorial on deep learning for music information retrieval</a> (Choi et al., 2017) on arXiv.\n",
966+
"[TensorFlow.js - Audio recognition using transfer learning codelab](https://codelabs.developers.google.com/codelabs/tensorflowjs-audio-codelab/index.html#0) teaches how to build your own interactive web app for audio classification.\n",
967+
"- [A tutorial on deep learning for music information retrieval](https://arxiv.org/abs/1709.04396) (Choi et al., 2017) on arXiv.\n",
968968
"- TensorFlow also has additional support for [audio data preparation and augmentation](https://www.tensorflow.org/io/tutorials/audio) to help with your own audio-based projects.\n",
969-
"- Consider using the <a href=\"https://librosa.org/\" class=\"external\">librosa</a> library—a Python package for music and audio analysis."
969+
"- Consider using the [librosa](https://librosa.org/) library for music and audio analysis."
970970
]
971971
}
972972
],
@@ -975,8 +975,6 @@
975975
"colab": {
976976
"collapsed_sections": [],
977977
"name": "simple_audio.ipynb",
978-
"private_outputs": true,
979-
"provenance": [],
980978
"toc_visible": true
981979
},
982980
"kernelspec": {

0 commit comments

Comments
 (0)