|
74 | 74 | "id": "SPfDNFlb66XF"
|
75 | 75 | },
|
76 | 76 | "source": [
|
77 |
| - "This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic <a href=\"https://en.wikipedia.org/wiki/Speech_recognition\" class=\"external\">automatic speech recognition</a> (ASR) model for recognizing ten different words. You will use a portion of the [Speech Commands dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) (<a href=\"https://arxiv.org/abs/1804.03209\" class=\"external\">Warden, 2018</a>), which contains short (one-second or less) audio clips of commands, such as \"down\", \"go\", \"left\", \"no\", \"right\", \"stop\", \"up\" and \"yes\".\n", |
| 77 | + "This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic [automatic speech recognition](https://en.wikipedia.org/wiki/Speech_recognition) (ASR) model for recognizing ten different words. You will use a portion of the [Speech Commands dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) ([Warden, 2018](https://arxiv.org/abs/1804.03209)), which contains short (one-second or less) audio clips of commands, such as \"down\", \"go\", \"left\", \"no\", \"right\", \"stop\", \"up\" and \"yes\".\n", |
78 | 78 | "\n",
|
79 |
| - "Real-world speech and audio recognition <a href=\"https://ai.googleblog.com/search/label/Speech%20Recognition\" class=\"external\">systems</a> are complex. But, like [image classification with the MNIST dataset](../quickstart/beginner.ipynb), this tutorial should give you a basic understanding of the techniques involved." |
| 79 | + "Real-world speech and audio recognition [systems](https://ai.googleblog.com/search/label/Speech%20Recognition) are complex. But, like [image classification with the MNIST dataset](../quickstart/beginner.ipynb), this tutorial should give you a basic understanding of the techniques involved." |
80 | 80 | ]
|
81 | 81 | },
|
82 | 82 | {
|
|
87 | 87 | "source": [
|
88 | 88 | "## Setup\n",
|
89 | 89 | "\n",
|
90 |
| - "Import necessary modules and dependencies. You'll be using `tf.keras.utils.audio_dataset_from_directory` (introduced in TensorFlow 2.10), which helps generate audio classification datasets from directories of `.wav` files. You'll also need <a href=\"https://seaborn.pydata.org/\" class=\"external\">seaborn</a> for visualization in this tutorial." |
| 90 | + "Import necessary modules and dependencies. You'll be using `tf.keras.utils.audio_dataset_from_directory` (introduced in TensorFlow 2.10), which helps generate audio classification datasets from directories of `.wav` files. You'll also need [seaborn](https://seaborn.pydata.org) for visualization in this tutorial." |
91 | 91 | ]
|
92 | 92 | },
|
93 | 93 | {
|
|
136 | 136 | "source": [
|
137 | 137 | "## Import the mini Speech Commands dataset\n",
|
138 | 138 | "\n",
|
139 |
| - "To save time with data loading, you will be working with a smaller version of the Speech Commands dataset. The [original dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) consists of over 105,000 audio files in the <a href=\"https://www.aelius.com/njh/wavemetatools/doc/riffmci.pdf\" class=\"external\">WAV (Waveform) audio file format</a> of people saying 35 different words. This data was collected by Google and released under a CC BY license.\n", |
| 139 | + "To save time with data loading, you will be working with a smaller version of the Speech Commands dataset. The [original dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) consists of over 105,000 audio files in the [WAV (Waveform) audio file format](https://www.aelius.com/njh/wavemetatools/doc/riffmci.pdf) of people saying 35 different words. This data was collected by Google and released under a CC BY license.\n", |
140 | 140 | "\n",
|
141 | 141 | "Download and extract the `mini_speech_commands.zip` file containing the smaller Speech Commands datasets with `tf.keras.utils.get_file`:"
|
142 | 142 | ]
|
|
350 | 350 | "source": [
|
351 | 351 | "## Convert waveforms to spectrograms\n",
|
352 | 352 | "\n",
|
353 |
| - "The waveforms in the dataset are represented in the time domain. Next, you'll transform the waveforms from the time-domain signals into the time-frequency-domain signals by computing the <a href=\"https://en.wikipedia.org/wiki/Short-time_Fourier_transform\" class=\"external\">short-time Fourier transform (STFT)</a> to convert the waveforms to as <a href=\"https://en.wikipedia.org/wiki/Spectrogram\" clas=\"external\">spectrograms</a>, which show frequency changes over time and can be represented as 2D images. You will feed the spectrogram images into your neural network to train the model.\n", |
| 353 | + "The waveforms in the dataset are represented in the time domain. Next, you'll transform the waveforms from the time-domain signals into the time-frequency-domain signals by computing the [short-time Fourier transform (STFT)](https://en.wikipedia.org/wiki/Short-time_Fourier_transform) to convert the waveforms to as [spectrograms](https://en.wikipedia.org/wiki/Spectrogram), which show frequency changes over time and can be represented as 2D images. You will feed the spectrogram images into your neural network to train the model.\n", |
354 | 354 | "\n",
|
355 | 355 | "A Fourier transform (`tf.signal.fft`) converts a signal to its component frequencies, but loses all time information. In comparison, STFT (`tf.signal.stft`) splits the signal into windows of time and runs a Fourier transform on each window, preserving some time information, and returning a 2D tensor that you can run standard convolutions on.\n",
|
356 | 356 | "\n",
|
357 | 357 | "Create a utility function for converting waveforms to spectrograms:\n",
|
358 | 358 | "\n",
|
359 | 359 | "- The waveforms need to be of the same length, so that when you convert them to spectrograms, the results have similar dimensions. This can be done by simply zero-padding the audio clips that are shorter than one second (using `tf.zeros`).\n",
|
360 |
| - "- When calling `tf.signal.stft`, choose the `frame_length` and `frame_step` parameters such that the generated spectrogram \"image\" is almost square. For more information on the STFT parameters choice, refer to <a href=\"https://www.coursera.org/lecture/audio-signal-processing/stft-2-tjEQe\" class=\"external\">this Coursera video</a> on audio signal processing and STFT.\n", |
| 360 | + "- When calling `tf.signal.stft`, choose the `frame_length` and `frame_step` parameters such that the generated spectrogram \"image\" is almost square. For more information on the STFT parameters choice, refer to [this Coursera video](https://www.coursera.org/lecture/audio-signal-processing/stft-2-tjEQe) on audio signal processing and STFT.\n", |
361 | 361 | "- The STFT produces an array of complex numbers representing magnitude and phase. However, in this tutorial you'll only use the magnitude, which you can derive by applying `tf.abs` on the output of `tf.signal.stft`."
|
362 | 362 | ]
|
363 | 363 | },
|
|
750 | 750 | "source": [
|
751 | 751 | "### Display a confusion matrix\n",
|
752 | 752 | "\n",
|
753 |
| - "Use a <a href=\"https://developers.google.com/machine-learning/glossary#confusion-matrix\" class=\"external\">confusion matrix</a> to check how well the model did classifying each of the commands in the test set:\n" |
| 753 | + "Use a [confusion matrix](https://developers.google.com/machine-learning/glossary#confusion-matrix) to check how well the model did classifying each of the commands in the test set:\n" |
754 | 754 | ]
|
755 | 755 | },
|
756 | 756 | {
|
|
961 | 961 | "This tutorial demonstrated how to carry out simple audio classification/automatic speech recognition using a convolutional neural network with TensorFlow and Python. To learn more, consider the following resources:\n",
|
962 | 962 | "\n",
|
963 | 963 | "- The [Sound classification with YAMNet](https://www.tensorflow.org/hub/tutorials/yamnet) tutorial shows how to use transfer learning for audio classification.\n",
|
964 |
| - "- The notebooks from <a href=\"https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/overview\" class=\"external\">Kaggle's TensorFlow speech recognition challenge</a>.\n", |
| 964 | + "- The notebooks from [Kaggle's TensorFlow speech recognition challenge](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/overview).\n", |
965 | 965 | "- The \n",
|
966 |
| - "<a href=\"https://codelabs.developers.google.com/codelabs/tensorflowjs-audio-codelab/index.html#0\" class=\"external\">TensorFlow.js - Audio recognition using transfer learning codelab</a> teaches how to build your own interactive web app for audio classification.\n", |
967 |
| - "- <a href=\"https://arxiv.org/abs/1709.04396\" class=\"external\">A tutorial on deep learning for music information retrieval</a> (Choi et al., 2017) on arXiv.\n", |
| 966 | + "[TensorFlow.js - Audio recognition using transfer learning codelab](https://codelabs.developers.google.com/codelabs/tensorflowjs-audio-codelab/index.html#0) teaches how to build your own interactive web app for audio classification.\n", |
| 967 | + "- [A tutorial on deep learning for music information retrieval](https://arxiv.org/abs/1709.04396) (Choi et al., 2017) on arXiv.\n", |
968 | 968 | "- TensorFlow also has additional support for [audio data preparation and augmentation](https://www.tensorflow.org/io/tutorials/audio) to help with your own audio-based projects.\n",
|
969 |
| - "- Consider using the <a href=\"https://librosa.org/\" class=\"external\">librosa</a> library—a Python package for music and audio analysis." |
| 969 | + "- Consider using the [librosa](https://librosa.org/) library for music and audio analysis." |
970 | 970 | ]
|
971 | 971 | }
|
972 | 972 | ],
|
|
0 commit comments