Merge pull request #2245 from 8bitmp3:upda-warmstart-embedding

copybara-github · copybara-github · commit c754bc1190f6 · 2023-07-17T15:59:11.000-07:00
PiperOrigin-RevId: 548822493
diff --git a/site/en/tutorials/text/warmstart_embedding_matrix.ipynb b/site/en/tutorials/text/warmstart_embedding_matrix.ipynb
@@ -91,7 +91,7 @@
       "source": [
         "### Vocabulary\n",
         "\n",
-        "The set of unique words is referred to as the vocabulary. To build a text model you need to choose a fixed vocabulary. Typically you you build the vocabulary from the most common words in a dataset. The vocabulary allows us to represent each piece of text by a sequence of ID's that you can lookup in the embedding matrix. Vocabulary allows us to represent each piece of text by the specific words that appear in it."
+        "The set of unique words is referred to as the vocabulary. To build a text model you need to choose a fixed vocabulary. Typically you build the vocabulary from the most common words in a dataset. The vocabulary allows us to represent each piece of text by a sequence of ID's that you can lookup in the embedding matrix. Vocabulary allows us to represent each piece of text by the specific words that appear in it."
       ]
     },
     {
@@ -104,7 +104,7 @@
         "\n",
         "A model is trained with a set of embeddings that represents a given vocabulary. If the model needs to be updated or improved you can train to convergence significantly faster by reusing weights from a previous run. Using the embedding matrix from a previous run is more difficult. The problem is that any change to the vocabulary invalidates the word to id mapping.\n",
         "\n",
-        "The `tf.keras.utils.warmstart_embedding_matrix` solves this problem by creating an embedding matrix for a new vocabulary from an embedding martix from a base vocabulary. Where a word exists in both vocabularies the base embedding vector is copied into the correct location in the new embedding matrix. This allows you to warm-start training after any change in the size or order of the vocabulary."
+        "The `tf.keras.utils.warmstart_embedding_matrix` solves this problem by creating an embedding matrix for a new vocabulary from an embedding matrix from a base vocabulary. Where a word exists in both vocabularies the base embedding vector is copied into the correct location in the new embedding matrix. This allows you to warm-start training after any change in the size or order of the vocabulary."
       ]
     },
     {
@@ -155,7 +155,7 @@
       },
       "source": [
         "### Load the dataset\n",
-        "The tutorial uses the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/). You will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch. Refer to [Loading text tutorial](https://www.tensorflow.org/tutorials/load_data/text) to learn more.  \n",
+        "The tutorial uses the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/). You will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch. Refer to the [Loading text tutorial](https://www.tensorflow.org/tutorials/load_data/text) to learn more.  \n",
         "\n",
         "Download the dataset using Keras file utility and review the directories."
       ]
@@ -184,7 +184,7 @@
         "id": "eY6yROZNKvbd"
       },
       "source": [
-        "The `train/` directory has `pos` and `neg` folders with movie reviews labelled as positive and negative respectively. You will use reviews from `pos` and `neg` folders to train a binary classification model."
+        "The `train/` directory has `pos` and `neg` folders with movie reviews labeled as positive and negative respectively. You will use reviews from `pos` and `neg` folders to train a binary classification model."
       ]
     },
     {
@@ -715,7 +715,7 @@
       "source": [
         "You have successfully updated the model to accept a new vocabulary. The embedding layer is updated to map old vocabulary words to old embeddings and initialize embeddings for new vocabulary words to be learnt. The learned weights of the rest of the model will remain the same. The model is warm-started to continue to train from where it left off previously.\n",
         "\n",
-        "You can now verify that the remapping worked. Get index of the vocabulary word \"the\" that is present both in base and new vocabulary and compare the embedding values. They should be equal."
+        "You can now verify that the remapping worked. Get the index of the vocabulary word \"the\" that is present both in base and new vocabulary and compare the embedding values. They should be equal."
       ]
     },
     {
@@ -745,7 +745,7 @@
       "source": [
         "## Continue with warm-started training\n",
         "\n",
-        "Notice how the training is warm-started. The accuracy of first epoch is around 85%. Close to the accuracy where the previous traning ended."
+        "Notice how the training is warm-started. The accuracy of first epoch is around 85%. This is close to the accuracy where the previous training ended."
       ]
     },
     {
@@ -823,7 +823,6 @@
     "colab": {
       "collapsed_sections": [],
       "name": "warmstart_embedding_matrix.ipynb",
-      "provenance": [],
       "toc_visible": true
     },
     "kernelspec": {

Original file line number	Diff line number	Diff line change
`@@ -91,7 +91,7 @@`
`91`	`91`	`"source": [`
`92`	`92`	`"### Vocabulary\n",`
`93`	`93`	`"\n",`
`94`		`- "The set of unique words is referred to as the vocabulary. To build a text model you need to choose a fixed vocabulary. Typically you you build the vocabulary from the most common words in a dataset. The vocabulary allows us to represent each piece of text by a sequence of ID's that you can lookup in the embedding matrix. Vocabulary allows us to represent each piece of text by the specific words that appear in it."`
	`94`	`+ "The set of unique words is referred to as the vocabulary. To build a text model you need to choose a fixed vocabulary. Typically you build the vocabulary from the most common words in a dataset. The vocabulary allows us to represent each piece of text by a sequence of ID's that you can lookup in the embedding matrix. Vocabulary allows us to represent each piece of text by the specific words that appear in it."`
`95`	`95`	`]`
`96`	`96`	`},`
`97`	`97`	`{`
`@@ -104,7 +104,7 @@`
`104`	`104`	`"\n",`
`105`	`105`	`"A model is trained with a set of embeddings that represents a given vocabulary. If the model needs to be updated or improved you can train to convergence significantly faster by reusing weights from a previous run. Using the embedding matrix from a previous run is more difficult. The problem is that any change to the vocabulary invalidates the word to id mapping.\n",`
`106`	`106`	`"\n",`
`107`		- "The `tf.keras.utils.warmstart_embedding_matrix` solves this problem by creating an embedding matrix for a new vocabulary from an embedding martix from a base vocabulary. Where a word exists in both vocabularies the base embedding vector is copied into the correct location in the new embedding matrix. This allows you to warm-start training after any change in the size or order of the vocabulary."
	`107`	+ "The `tf.keras.utils.warmstart_embedding_matrix` solves this problem by creating an embedding matrix for a new vocabulary from an embedding matrix from a base vocabulary. Where a word exists in both vocabularies the base embedding vector is copied into the correct location in the new embedding matrix. This allows you to warm-start training after any change in the size or order of the vocabulary."
`108`	`108`	`]`
`109`	`109`	`},`
`110`	`110`	`{`
`@@ -155,7 +155,7 @@`
`155`	`155`	`},`
`156`	`156`	`"source": [`
`157`	`157`	`"### Load the dataset\n",`
`158`		`- "The tutorial uses the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/). You will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch. Refer to [Loading text tutorial](https://www.tensorflow.org/tutorials/load_data/text) to learn more. \n",`
	`158`	`+ "The tutorial uses the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/). You will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch. Refer to the [Loading text tutorial](https://www.tensorflow.org/tutorials/load_data/text) to learn more. \n",`
`159`	`159`	`"\n",`
`160`	`160`	`"Download the dataset using Keras file utility and review the directories."`
`161`	`161`	`]`
`@@ -184,7 +184,7 @@`
`184`	`184`	`"id": "eY6yROZNKvbd"`
`185`	`185`	`},`
`186`	`186`	`"source": [`
`187`		- "The `train/` directory has `pos` and `neg` folders with movie reviews labelled as positive and negative respectively. You will use reviews from `pos` and `neg` folders to train a binary classification model."
	`187`	+ "The `train/` directory has `pos` and `neg` folders with movie reviews labeled as positive and negative respectively. You will use reviews from `pos` and `neg` folders to train a binary classification model."
`188`	`188`	`]`
`189`	`189`	`},`
`190`	`190`	`{`
`@@ -715,7 +715,7 @@`
`715`	`715`	`"source": [`
`716`	`716`	`"You have successfully updated the model to accept a new vocabulary. The embedding layer is updated to map old vocabulary words to old embeddings and initialize embeddings for new vocabulary words to be learnt. The learned weights of the rest of the model will remain the same. The model is warm-started to continue to train from where it left off previously.\n",`
`717`	`717`	`"\n",`
`718`		`- "You can now verify that the remapping worked. Get index of the vocabulary word \"the\" that is present both in base and new vocabulary and compare the embedding values. They should be equal."`
	`718`	`+ "You can now verify that the remapping worked. Get the index of the vocabulary word \"the\" that is present both in base and new vocabulary and compare the embedding values. They should be equal."`
`719`	`719`	`]`
`720`	`720`	`},`
`721`	`721`	`{`
`@@ -745,7 +745,7 @@`
`745`	`745`	`"source": [`
`746`	`746`	`"## Continue with warm-started training\n",`
`747`	`747`	`"\n",`
`748`		`- "Notice how the training is warm-started. The accuracy of first epoch is around 85%. Close to the accuracy where the previous traning ended."`
	`748`	`+ "Notice how the training is warm-started. The accuracy of first epoch is around 85%. This is close to the accuracy where the previous training ended."`
`749`	`749`	`]`
`750`	`750`	`},`
`751`	`751`	`{`
`@@ -823,7 +823,6 @@`
`823`	`823`	`"colab": {`
`824`	`824`	`"collapsed_sections": [],`
`825`	`825`	`"name": "warmstart_embedding_matrix.ipynb",`
`826`		`- "provenance": [],`
`827`	`826`	`"toc_visible": true`
`828`	`827`	`},`
`829`	`828`	`"kernelspec": {`