Adding information about MoViNet about StreamBuffer and causal convolutions.

shilpakancharla · copybara-github · commit d5ad3ee18d35 · 2022-12-18T06:23:05.000-08:00
PiperOrigin-RevId: 496215181
diff --git a/site/en/tutorials/video/transfer_learning_with_movinet.ipynb b/site/en/tutorials/video/transfer_learning_with_movinet.ipynb
@@ -466,21 +466,100 @@
         "print(f\"Label: {labels.shape}\")"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "lxbhPqXGvc_F"
+      },
+      "source": [
+        "## What are MoViNets?\n",
+        "\n",
+        "As mentioned previously, [MoViNets](https://arxiv.org/abs/2103.11511) are video classification models used for streaming video or online inference in tasks, such as action recognition. Consider using MoViNets to classify your video data for action recognition.\n",
+        "\n",
+        "A 2D frame based classifier is efficient and simple to run over whole videos, or streaming one frame at a time. Because they can't take temporal context into account they have limited accuracy and may give inconsistent outputs from frame to frame.\n",
+        "\n",
+        "A simple 3D CNN uses bidirectional temporal context which can increase accuracy and temporal consistency. These networks may require more resources and because they look into the future they can't be used for streaming data.\n",
+        "\n",
+        "![Standard convolution](https://www.tensorflow.org/images/tutorials/video/standard_convolution.png)\n",
+        "\n",
+        "The MoViNet architecture uses 3D convolutions that are \"causal\" along the time axis (like `layers.Conv1D` with `padding=\"causal\"`). This gives some of the advantages of both approaches, mainly it allow for efficient streaming.\n",
+        "\n",
+        "![Causal convolution](https://www.tensorflow.org/images/tutorials/video/causal_convolution.png)\n",
+        "\n",
+        "Causal convolution ensures that the output at time *t* is computed using only inputs up to time *t*. To demonstrate how this can make streaming more efficient, start with a simpler example you may be familiar with: an RNN. The RNN passes state forward through time:\n",
+        "\n",
+        "![RNN model](https://www.tensorflow.org/images/tutorials/video/rnn_comparison.png)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "dMvDkgfFZC6a"
+      },
+      "outputs": [],
+      "source": [
+        "gru = layers.GRU(units=4, return_sequences=True, return_state=True)\n",
+        "\n",
+        "inputs = tf.random.normal(shape=[1, 10, 8]) # (batch, sequence, channels)\n",
+        "\n",
+        "result, state = gru(inputs) # Run it all at once"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "T7xyb5C4bTs7"
+      },
+      "source": [
+        "By setting the RNN's `return_sequences=True` argument you ask it to return the state at the end of the computation. This allows you to pause and then continue where you left off, to get exactly the same result:\n",
+        "\n",
+        "![States passing in RNNs](https://www.tensorflow.org/images/tutorials/video/rnn_state_passing.png)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "bI8FOPRRXXPa"
+      },
+      "outputs": [],
+      "source": [
+        "first_half, state = gru(inputs[:, :5, :])   # run the first half, and capture the state\n",
+        "second_half, _ = gru(inputs[:,5:, :], initial_state=state)  # Use the state to continue where you left off.\n",
+        "\n",
+        "print(np.allclose(result[:, :5,:], first_half))\n",
+        "print(np.allclose(result[:, 5:,:], second_half))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "KM3MArumY_Qk"
+      },
+      "source": [
+        "Causal convolutions can be used the same way, if handled with care. This technique was used in the [Fast Wavenet Generation Algorithm](https://arxiv.org/abs/1611.09482) by Le Paine et al. In the [MoVinet paper](https://arxiv.org/abs/2103.11511), the `state` is referred to as the \"Stream Buffer\".\n",
+        "\n",
+        "![States passed in causal convolution](https://www.tensorflow.org/images/tutorials/video/causal_conv_states.png)\n",
+        "\n",
+        "By passing this little bit of state forward, you can avoid recalculating the whole receptive field that shown above. "
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
         "id": "1UsxiPs8yA2e"
       },
       "source": [
-        "## Download pre-trained MoViNet model\n",
+        "## Download a pre-trained MoViNet model\n",
         "\n",
         "In this section, you will:\n",
         "\n",
         "1. You can create a MoViNet model using the open source code provided in [`official/projects/movinet`](https://github.com/tensorflow/models/tree/master/official/projects/movinet) from TensorFlow models.\n",
         "2. Load the pretrained weights. \n",
-        "3. Freeze the convolutional base, or all other layers except the final classifier head, in order to speed up fine-tuning.\n",
+        "3. Freeze the convolutional base, or all other layers except the final classifier head, to speed up fine-tuning.\n",
         "\n",
-        "To build the model, you can start with the `a0` configuration because it is the fastest to train when benchmarked against other models. Check out the [available models](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/movinet.py) to see what might work for your use-case."
+        "To build the model, you can start with the `a0` configuration because it is the fastest to train when benchmarked against other models. Check out the [available MoViNet models on TensorFlow Model Garden](https://github.com/tensorflow/models/blob/master/official/projects/movinet/configs/movinet.py) to find what might work for your use case."
       ]
     },
     {
@@ -520,7 +599,7 @@
         "id": "BW23HVNtCXff"
       },
       "source": [
-        "To build a classifier, create a function that takes the backbone and the number of classes in a dataset. The `build_classifier` function will take the backbone and the number of classes in a dataset in order to build the classifier. In this case, the new classifier will take a `num_classes` outputs (10 classes for this subset of UCF101)."
+        "To build a classifier, create a function that takes the backbone and the number of classes in a dataset. The `build_classifier` function will take the backbone and the number of classes in a dataset to build the classifier. In this case, the new classifier will take a `num_classes` outputs (10 classes for this subset of UCF101)."
       ]
     },
     {
@@ -633,7 +712,7 @@
         "id": "OkFst2gsHBwD"
       },
       "source": [
-        "To visualize model performance further, use a [confusion matrix](https://www.tensorflow.org/api_docs/python/tf/math/confusion_matrix). The confusion matrix allows you to assess the performance of the classification model beyond accuracy. In order to build the confusion matrix for this multi-class classification problem, get the actual values in the test set and the predicted values."
+        "To visualize model performance further, use a [confusion matrix](https://www.tensorflow.org/api_docs/python/tf/math/confusion_matrix). The confusion matrix allows you to assess the performance of the classification model beyond accuracy. To build the confusion matrix for this multi-class classification problem, get the actual values in the test set and the predicted values."
       ]
     },
     {
@@ -718,8 +797,7 @@
       "source": [
         "## Next steps\n",
         "\n",
-        "Now that you have some familiarity with the MoViNet model and how to leverage various TensorFlow APIs (for example, for transfer learning), try using the code in this tutorial with your own dataset. The data does not have to be limited to video data. Volumetric data, such as MRI scans, can also be used with 3D CNNs. The NUSDAT and IMH datasets mentioned in [Brain MRI-based 3D Convolutional Neural Networks for\n",
-        "Classification of Schizophrenia and Controls ](https://arxiv.org/pdf/2003.08818.pdf) could be two such sources for MRI data.\n",
+        "Now that you have some familiarity with the MoViNet model and how to leverage various TensorFlow APIs (for example, for transfer learning), try using the code in this tutorial with your own dataset. The data does not have to be limited to video data. Volumetric data, such as MRI scans, can also be used with 3D CNNs. The NUSDAT and IMH datasets mentioned in [Brain MRI-based 3D Convolutional Neural Networks for Classification of Schizophrenia and Controls](https://arxiv.org/pdf/2003.08818.pdf) could be two such sources for MRI data.\n",
         "\n",
         "In particular, using the `FrameGenerator` class used in this tutorial and the other video data and classification tutorials will help you load data into your models.\n",
         "\n",