Lint and update the Overfit and underfit tutorial

8bitmp3 · web-flow · commit 183c35a3c153 · 2022-02-07T03:46:33.000-08:00
diff --git a/site/en/tutorials/keras/overfit_and_underfit.ipynb b/site/en/tutorials/keras/overfit_and_underfit.ipynb
@@ -102,13 +102,13 @@
       "source": [
         "As always, the code in this example will use the `tf.keras` API, which you can learn more about in the TensorFlow [Keras guide](https://www.tensorflow.org/guide/keras).\n",
         "\n",
-        "In both of the previous examples—[classifying text](https://www.tensorflow.org/tutorials/keras/text_classification_with_hub) and [predicting fuel efficiency](https://www.tensorflow.org/tutorials/keras/regression) — we saw that the accuracy of our model on the validation data would peak after training for a number of epochs, and would then stagnate or start decreasing.\n",
+        "In both of the previous examples—[classifying text](text_classification_with_hub.ipynb) and [predicting fuel efficiency](regression.ipynb)—the accuracy of models on the validation data would peak after training for a number of epochs and then stagnate or start decreasing.\n",
         "\n",
-        "In other words, our model would *overfit* to the training data. Learning how to deal with overfitting is important. Although it's often possible to achieve high accuracy on the *training set*, what we really want is to develop models that generalize well to a *testing set* (or data they haven't seen before).\n",
+        "In other words, your model would *overfit* to the training data. Learning how to deal with overfitting is important. Although it's often possible to achieve high accuracy on the *training set*, what you really want is to develop models that generalize well to a *testing set* (or data they haven't seen before).\n",
         "\n",
         "The opposite of overfitting is *underfitting*. Underfitting occurs when there is still room for improvement on the train data. This can happen for a number of reasons: If the model is not powerful enough, is over-regularized, or has simply not been trained long enough. This means the network has not learned the relevant patterns in the training data.\n",
         "\n",
-        "If you train for too long though, the model will start to overfit and learn patterns from the training data that don't generalize to the test data. We need to strike a balance. Understanding how to train for an appropriate number of epochs as we'll explore below is a useful skill.\n",
+        "If you train for too long though, the model will start to overfit and learn patterns from the training data that don't generalize to the test data. You need to strike a balance. Understanding how to train for an appropriate number of epochs as we'll explore below is a useful skill.\n",
         "\n",
         "To prevent overfitting, the best solution is to use more complete training data. The dataset should cover the full range of inputs that the model is expected to handle. Additional data may only be useful if it covers new and interesting cases.\n",
         "\n",
@@ -202,9 +202,9 @@
         "id": "1cweoTiruj8O"
       },
       "source": [
-        "## The Higgs Dataset\n",
+        "## The Higgs dataset\n",
         "\n",
-        "The goal of this tutorial is not to do particle physics, so don't dwell on the details of the dataset. It contains 11&#x202F;000&#x202F;000 examples, each with 28 features, and a binary class label."
+        "The goal of this tutorial is not to do particle physics, so don't dwell on the details of the dataset. It contains 11,000,000 examples, each with 28 features, and a binary class label."
       ]
     },
     {
@@ -280,7 +280,7 @@
       "source": [
         "TensorFlow is most efficient when operating on large batches of data.\n",
         "\n",
-        "So instead of repacking each row individually make a new `Dataset` that takes batches of 10000-examples, applies the `pack_row` function to each batch, and then splits the batches back up into individual records:"
+        "So, instead of repacking each row individually make a new `tf.data.Dataset` that takes batches of 10,000 examples, applies the `pack_row` function to each batch, and then splits the batches back up into individual records:"
       ]
     },
     {
@@ -300,7 +300,7 @@
         "id": "lUbxc5bxNSXV"
       },
       "source": [
-        "Have a look at some of the records from this new `packed_ds`.\n",
+        "Inspect some of the records from this new `packed_ds`.\n",
         "\n",
         "The features are not perfectly normalized, but this is sufficient for this tutorial."
       ]
@@ -324,7 +324,7 @@
         "id": "ICKZRY7gN-QM"
       },
       "source": [
-        "To keep this tutorial relatively short use just the first 1000 samples for validation, and the next 10 000 for training:"
+        "To keep this tutorial relatively short, use just the first 1,000 samples for validation, and the next 10,000 for training:"
       ]
     },
     {
@@ -382,7 +382,7 @@
         "id": "6PMliHoVO3OL"
       },
       "source": [
-        "These datasets return individual examples. Use the `.batch` method to create batches of an appropriate size for training. Before batching also remember to `.shuffle` and `.repeat` the training set."
+        "These datasets return individual examples. Use the `Dataset.batch` method to create batches of an appropriate size for training. Before batching, also remember to use `Dataset.shuffle` and `Dataset.repeat` on the training set."
       ]
     },
     {
@@ -417,7 +417,7 @@
         "\n",
         "To find an appropriate model size, it's best to start with relatively few layers and parameters, then begin increasing the size of the layers or adding new layers until you see diminishing returns on the validation loss.\n",
         "\n",
-        "Start with a simple model using only `layers.Dense` as a baseline, then create larger versions, and compare them."
+        "Start with a simple model using only densely-connected layers (`tf.keras.layers.Dense`) as a baseline, then create larger models, and compare them."
       ]
     },
     {
@@ -435,7 +435,7 @@
         "id": "pNzkSkkXSP5l"
       },
       "source": [
-        "Many models train better if you gradually reduce the learning rate during training. Use `optimizers.schedules` to reduce the learning rate over time:"
+        "Many models train better if you gradually reduce the learning rate during training. Use `tf.keras.optimizers.schedules` to reduce the learning rate over time:"
       ]
     },
     {
@@ -462,7 +462,7 @@
         "id": "kANLx6OYTQ8B"
       },
       "source": [
-        "The code above sets a `schedules.InverseTimeDecay` to hyperbolically decrease the learning rate to 1/2 of the base rate at 1000 epochs, 1/3 at 2000 epochs and so on."
+        "The code above sets a `tf.keras.optimizers.schedules.InverseTimeDecay` to hyperbolically decrease the learning rate to 1/2 of the base rate at 1,000 epochs, 1/3 at 2,000 epochs, and so on."
       ]
     },
     {
@@ -492,7 +492,7 @@
         "\n",
         "The training for this tutorial runs for many short epochs. To reduce the logging noise use the `tfdocs.EpochDots` which simply prints a `.` for each epoch, and a full set of metrics every 100 epochs.\n",
         "\n",
-        "Next include `callbacks.EarlyStopping` to avoid long and unnecessary training times. Note that this callback is set to monitor the `val_binary_crossentropy`, not the `val_loss`. This difference will be important later.\n",
+        "Next include `tf.keras.callbacks.EarlyStopping` to avoid long and unnecessary training times. Note that this callback is set to monitor the `val_binary_crossentropy`, not the `val_loss`. This difference will be important later.\n",
         "\n",
         "Use `callbacks.TensorBoard` to generate TensorBoard logs for the training.\n"
       ]
@@ -643,7 +643,7 @@
         "id": "YjMb6E72f2pN"
       },
       "source": [
-        "To see if you can beat the performance of the small model, progressively train some larger models.\n",
+        "To check if you can beat the performance of the small model, progressively train some larger models.\n",
         "\n",
         "Try two hidden layers with 16 units each:"
       ]
@@ -690,7 +690,7 @@
         "id": "SrfoVQheYSO5"
       },
       "source": [
-        "Now try 3 hidden layers with 64 units each:"
+        "Now try three hidden layers with 64 units each:"
       ]
     },
     {
@@ -737,7 +737,7 @@
       "source": [
         "### Large model\n",
         "\n",
-        "As an exercise, you can create an even larger model, and see how quickly it begins overfitting.  Next, let's add to this benchmark a network that has much more capacity, far more than the problem would warrant:"
+        "As an exercise, you can create an even larger model and check how quickly it begins overfitting. Next, add to this benchmark a network that has much more capacity, far more than the problem would warrant:"
       ]
     },
     {
@@ -803,7 +803,7 @@
       "source": [
         "While building a larger model gives it more power, if this power is not constrained somehow it can easily overfit to the training set.\n",
         "\n",
-        "In this example, typically, only the `\"Tiny\"` model manages to avoid overfitting altogether, and each of the larger models overfit the data more quickly. This becomes so severe for the `\"large\"` model that you need to switch the plot to a log-scale to really see what's happening.\n",
+        "In this example, typically, only the `\"Tiny\"` model manages to avoid overfitting altogether, and each of the larger models overfit the data more quickly. This becomes so severe for the `\"large\"` model that you need to switch the plot to a log-scale to really figure out what's happening.\n",
         "\n",
         "This is apparent if you plot and compare the validation metrics to the training metrics.\n",
         "\n",
@@ -969,15 +969,15 @@
       "source": [
         "You may be familiar with Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the \"simplest\" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simpler models are less likely to overfit than complex ones.\n",
         "\n",
-        "A \"simple model\" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more \"regular\". This is called \"weight regularization\", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:\n",
+        "A \"simple model\" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as demonstrated in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more \"regular\". This is called \"weight regularization\", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:\n",
         "\n",
         "* [L1 regularization](https://developers.google.com/machine-learning/glossary/#L1_regularization), where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the \"L1 norm\" of the weights).\n",
         "\n",
         "* [L2 regularization](https://developers.google.com/machine-learning/glossary/#L2_regularization), where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the squared \"L2 norm\" of the weights). L2 regularization is also called weight decay in the context of neural networks. Don't let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.\n",
         "\n",
-        "L1 regularization pushes weights towards exactly zero encouraging a sparse model. L2 regularization will penalize the weights parameters without making them sparse since the penalty goes to zero for small weights-one reason why L2 is more common.\n",
+        "L1 regularization pushes weights towards exactly zero, encouraging a sparse model. L2 regularization will penalize the weights parameters without making them sparse since the penalty goes to zero for small weights—one reason why L2 is more common.\n",
         "\n",
-        "In `tf.keras`, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. Let's add L2 weight regularization now."
+        "In `tf.keras`, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. Add L2 weight regularization:"
       ]
     },
     {
@@ -1035,7 +1035,7 @@
         "id": "Kx1YHMsVxWjP"
       },
       "source": [
-        "As you can see, the `\"L2\"` regularized model is now much more competitive with the the `\"Tiny\"` model. This `\"L2\"` model is also much more resistant to overfitting than the `\"Large\"` model it was based on despite having the same number of parameters."
+        "As demonstrated in the diagram above, the `\"L2\"` regularized model is now much more competitive with the `\"Tiny\"` model. This `\"L2\"` model is also much more resistant to overfitting than the `\"Large\"` model it was based on despite having the same number of parameters."
       ]
     },
     {
@@ -1046,9 +1046,9 @@
       "source": [
         "#### More info\n",
         "\n",
-        "There are two important things to note about this sort of regularization.\n",
+        "There are two important things to note about this sort of regularization:\n",
         "\n",
-        "**First:** if you are writing your own training loop, then you need to be sure to ask the model for its regularization losses."
+        "1. If you are writing your own training loop, then you need to be sure to ask the model for its regularization losses."
       ]
     },
     {
@@ -1069,9 +1069,9 @@
         "id": "MLhG6fMSjE-J"
       },
       "source": [
-        "**Second:** This implementation works by adding the weight penalties to the model's loss, and then applying a standard optimization procedure after that.\n",
+        "2. This implementation works by adding the weight penalties to the model's loss, and then applying a standard optimization procedure after that.\n",
         "\n",
-        "There is a second approach that instead only runs the optimizer on the raw loss, and then while applying the calculated step the optimizer also applies some weight decay. This \"Decoupled Weight Decay\" is seen in optimizers like `optimizers.FTRL` and `optimizers.AdamW`."
+        "There is a second approach that instead only runs the optimizer on the raw loss, and then while applying the calculated step the optimizer also applies some weight decay. This \"decoupled weight decay\" is used in optimizers like `tf.keras.optimizers.Ftrl` and `tfa.optimizers.AdamW`."
       ]
     },
     {
@@ -1086,14 +1086,13 @@
         "\n",
         "The intuitive explanation for dropout is that because individual nodes in the network cannot rely on the output of the others, each node must output features that are useful on their own.\n",
         "\n",
-        "Dropout, applied to a layer, consists of randomly \"dropping out\" (i.e. set to zero) a number of output features of the layer during training. Let's say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5,\n",
-        "1.3, 0, 1.1].\n",
+        "Dropout, applied to a layer, consists of randomly \"dropping out\" (i.e. set to zero) a number of output features of the layer during training. For example, a given layer would normally have returned a vector `[0.2, 0.5, 1.3, 0.8, 1.1]` for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. `[0, 0.5, 1.3, 0, 1.1]`.\n",
         "\n",
         "The \"dropout rate\" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.\n",
         "\n",
-        "In `tf.keras` you can introduce dropout in a network via the Dropout layer, which gets applied to the output of layer right before.\n",
+        "In Keras, you can introduce dropout in a network via the `tf.keras.layers.Dropout` layer, which gets applied to the output of layer right before.\n",
         "\n",
-        "Let's add two Dropout layers in our network to see how well they do at reducing overfitting:"
+        "Add two dropout layers to your network to check how well they do at reducing overfitting:"
       ]
     },
     {
@@ -1269,7 +1268,7 @@
         "id": "gjfnkEeQyAFG"
       },
       "source": [
-        "To recap: here are the most common ways to prevent overfitting in neural networks:\n",
+        "To recap, here are the most common ways to prevent overfitting in neural networks:\n",
         "\n",
         "* Get more training data.\n",
         "* Reduce the capacity of the network.\n",
@@ -1278,8 +1277,8 @@
         "\n",
         "Two important approaches not covered in this guide are:\n",
         "\n",
-        "* data-augmentation\n",
-        "* batch normalization\n",
+        "* [Data augmentation](../images/data_augmentation.ipynb)\n",
+        "* Batch normalization (`tf.keras.layers.BatchNormalization`)\n",
         "\n",
         "Remember that each method can help on its own, but often combining them can be even more effective."
       ]