Skip to content

Commit e4ef6d5

Browse files
authored
Update and lint the Transfer Learning with YAMNet tutorial
1 parent 1ddc6e6 commit e4ef6d5

File tree

1 file changed

+19
-21
lines changed

1 file changed

+19
-21
lines changed

site/en/tutorials/audio/transfer_learning_audio.ipynb

Lines changed: 19 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@
6464
"source": [
6565
"# Transfer Learning with YAMNet for environmental sound classification\n",
6666
"\n",
67-
"[YAMNet](https://tfhub.dev/google/yamnet/1) is a pretrained deep neural network that can predict audio events from [521 classes](https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv), like laughter, barking, or a siren. \n",
67+
"[YAMNet](https://tfhub.dev/google/yamnet/1) is a pre-trained deep neural network that can predict audio events from [521 classes](https://github.com/tensorflow/models/blob/master/research/audioset/yamnet/yamnet_class_map.csv), like laughter, barking, or a siren. \n",
6868
"\n",
6969
" In this tutorial you will learn how to:\n",
7070
"\n",
@@ -130,19 +130,17 @@
130130
"source": [
131131
"## About YAMNet\n",
132132
"\n",
133-
"[YAMNet](https://github.com/tensorflow/models/tree/master/research/audioset/yamnet) is a pretrained neural network that employs the [MobileNetV1](https://arxiv.org/abs/1704.04861) depthwise-separable convolution architecture. It can use an audio waveform as input and classify 521 audio events from the [AudioSet](http://g.co/audioset) corpus.\n",
133+
"[YAMNet](https://github.com/tensorflow/models/tree/master/research/audioset/yamnet) is a pre-trained neural network that employs the [MobileNetV1](https://arxiv.org/abs/1704.04861) depthwise-separable convolution architecture. It can use an audio waveform as input and classify 521 audio events from the [AudioSet](http://g.co/audioset) corpus.\n",
134134
"\n",
135135
"Internally, the model extracts \"frames\" from the audio signal and processes batches of these frames. This version of the model uses frames that are 0.96 second long and extracts one frame every 0.48 second.\n",
136136
"\n",
137-
"The model accepts a 1-D float32 Tensor or NumPy array containing a waveform of arbitrary length, represented as mono 16 kHz samples in the range `[-1.0, +1.0]`. This tutorial contains code to help you convert a `.wav` file into the correct format.\n",
137+
"The model accepts a 1-D float32 Tensor or NumPy array containing a waveform of arbitrary length, represented as single-channel (mono) 16 kHz samples in the range `[-1.0, +1.0]`. This tutorial contains code to help you convert WAV files into the supported format.\n",
138138
"\n",
139-
"The model returns 3 outputs, including the class scores, embeddings (which you will use for transfer learning), and the log mel spectrogram. You can find more details [here](https://tfhub.dev/google/yamnet/1), and this tutorial will walk you through using these in practice.\n",
139+
"The model returns 3 outputs, including the class scores, embeddings (which you will use for transfer learning), and the log mel [spectrogram](https://www.tensorflow.org/tutorials/audio/simple_audio#spectrogram). You can find more details [here](https://tfhub.dev/google/yamnet/1).\n",
140140
"\n",
141-
"One specific use of YAMNet is as a high-level feature extractor: the 1024-dimensional embedding output of YAMNet can be used as the input features of another shallow model which can then be trained on a small amount of data for a particular task. This allows the quick creation of specialized audio classifiers without requiring a lot of labeled data and without having to train a large model end-to-end.\n",
141+
"One specific use of YAMNet is as a high-level feature extractor - the 1,024-dimensional embedding output. You will use the base (YAMNet) model's input features and feed them into your shallower model consisting of one hidden [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer. Then, you will train the network on a small amount of data for audio classification _without_ requiring a lot of labeled data and training end-to-end. (This is similar to [transfer learning for image classification with TensorFlow Hub](https://www.tensorflow.org/tutorials/images/transfer_learning_with_hub) for more information.)\n",
142142
"\n",
143-
"You will use YAMNet's embeddings output for transfer learning and train one or more [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layers on top of this.\n",
144-
"\n",
145-
"First, you will try the model and see the results of classifying audio. You will then construct the data pre-processing pipeline.\n",
143+
"First, you will test the model and see the results of classifying audio. You will then construct the data pre-processing pipeline.\n",
146144
"\n",
147145
"### Loading YAMNet from TensorFlow Hub\n",
148146
"\n",
@@ -196,9 +194,9 @@
196194
"id": "mBm9y9iV2U_-"
197195
},
198196
"source": [
199-
"You will need a function to load audio files, which will also be used later when working with the training data.\n",
197+
"You will need a function to load audio files, which will also be used later when working with the training data. (Learn more about rading audio files and their labels in [Simple audio recognition](https://www.tensorflow.org/tutorials/audio/simple_audio#reading_audio_files_and_their_labels).\n",
200198
"\n",
201-
"Note: The returned `wav_data` from `load_wav_16k_mono` is already normalized to values in the `[-1.0, 1.0]` range (as stated in the model's [documentation](https://tfhub.dev/google/yamnet/1))."
199+
"Note: The returned `wav_data` from `load_wav_16k_mono` is already normalized to values in the `[-1.0, 1.0]` range (for more information, go to [YAMNet's documentation on TF Hub](https://tfhub.dev/google/yamnet/1))."
202200
]
203201
},
204202
{
@@ -213,7 +211,7 @@
213211
"\n",
214212
"@tf.function\n",
215213
"def load_wav_16k_mono(filename):\n",
216-
" \"\"\" read in a waveform file and convert to 16 kHz mono \"\"\"\n",
214+
" \"\"\" Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio. \"\"\"\n",
217215
" file_contents = tf.io.read_file(filename)\n",
218216
" wav, sample_rate = tf.audio.decode_wav(\n",
219217
" file_contents,\n",
@@ -312,7 +310,7 @@
312310
"source": [
313311
"## ESC-50 dataset\n",
314312
"\n",
315-
"The [ESC-50 dataset](https://github.com/karolpiczak/ESC-50#repository-content) - described in detail [here](https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf) - is a labeled collection of 2,000 five-second long environmental audio recordings. The data consists of 50 classes, with 40 examples per class.\n",
313+
"The [ESC-50 dataset](https://github.com/karolpiczak/ESC-50#repository-content) ([Piczak, 2015](https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf)) is a labeled collection of 2,000 five-second long environmental audio recordings. The dataset consists of 50 classes, with 40 examples per class.\n",
316314
"\n",
317315
"Download the dataset and extract it. \n"
318316
]
@@ -407,9 +405,9 @@
407405
"source": [
408406
"### Load the audio files and retrieve embeddings\n",
409407
"\n",
410-
"Here you'll apply the `load_wav_16k_mono` and prepare the wav data for the model.\n",
408+
"Here you'll apply the `load_wav_16k_mono` and prepare the WAV data for the model.\n",
411409
"\n",
412-
"When extracting embeddings from the wav data, you get an array of shape `(N, 1024)` where `N` is the number of frames that YAMNet found (one for every 0.48 seconds of audio)."
410+
"When extracting embeddings from the WAV data, you get an array of shape `(N, 1024)` where `N` is the number of frames that YAMNet found (one for every 0.48 seconds of audio)."
413411
]
414412
},
415413
{
@@ -418,9 +416,9 @@
418416
"id": "AKDT5RomaDKO"
419417
},
420418
"source": [
421-
"Your model will use each frame as one input. Therefore, you need to create a new column that has one frame per row. You also need to expand the labels and fold column to proper reflect these new rows.\n",
419+
"Your model will use each frame as one input. Therefore, you need to create a new column that has one frame per row. You also need to expand the labels and the `fold` column to proper reflect these new rows.\n",
422420
"\n",
423-
"The expanded fold column keeps the original value. You cannot mix frames because, when performing the splits, you might end up having parts of the same audio on different splits - that would make your validation and test steps less effective."
421+
"The expanded `fold` column keeps the original values. You cannot mix frames because, when performing the splits, you might end up having parts of the same audio on different splits, which would make your validation and test steps less effective."
424422
]
425423
},
426424
{
@@ -484,11 +482,11 @@
484482
"source": [
485483
"### Split the data\n",
486484
"\n",
487-
"You will use the `fold` column to split the dataset into train, validation and test.\n",
485+
"You will use the `fold` column to split the dataset into train, validation and test sets.\n",
488486
"\n",
489-
"The fold values are so that files from the same original wav file are keep on the same split, you can find more information on the [paper](https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf) describing the dataset.\n",
487+
"ESC-50 is arranged into five uniformly-sized cross-validation `fold`s, such that clips from the same original source are always in the same `fold` - find out more in the [ESC: Dataset for Environmental Sound Classification](https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf) paper.\n",
490488
"\n",
491-
"The last step is to remove the `fold` column from the dataset since we're not going to use it anymore on the training process.\n"
489+
"The last step is to remove the `fold` column from the dataset since you're not going to use it during training.\n"
492490
]
493491
},
494492
{
@@ -525,7 +523,7 @@
525523
"## Create your model\n",
526524
"\n",
527525
"You did most of the work!\n",
528-
"Next, define a very simple Sequential model with one hidden layer and two outputs to recognize cats and dogs.\n"
526+
"Next, define a very simple [Sequential](https://www.tensorflow.org/guide/keras/sequential_model) model with one hidden layer and two outputs to recognize cats and dogs from sounds.\n"
529527
]
530528
},
531529
{
@@ -649,7 +647,7 @@
649647
"\n",
650648
"To do that, you will combine YAMNet with your model into a single model that you can export for other applications.\n",
651649
"\n",
652-
"To make it easier to use the model's result, the final layer will be a `reduce_mean` operation. When using this model for serving, as you will see bellow, you will need the name of the final layer. If you don't define one, TensorFlow will auto-define an incremental one that makes it hard to test, as it will keep changing every time you train the model. When using a raw tf operation you can't assign a name to it. To address this issue, you'll create a custom layer that just apply `reduce_mean` and you will call it 'classifier'.\n"
650+
"To make it easier to use the model's result, the final layer will be a `reduce_mean` operation. When using this model for serving, as you will see bellow, you will need the name of the final layer. If you don't define one, TensorFlow will auto-define an incremental one that makes it hard to test, as it will keep changing every time you train the model. When using a raw TensorFlow operation, you can't assign a name to it. To address this issue, you'll create a custom layer that applies `reduce_mean` and call it `'classifier'`.\n"
653651
]
654652
},
655653
{

0 commit comments

Comments
 (0)