Update parameter server strategy doc to use legacy optimizer in order to use a constant learning rate.

rchao · copybara-github · commit 5d7c4c291249 · 2023-05-11T15:35:17.000-07:00
PiperOrigin-RevId: 531325144
diff --git a/site/en/tutorials/distribute/parameter_server_training.ipynb b/site/en/tutorials/distribute/parameter_server_training.ipynb
@@ -1292,7 +1292,11 @@
         "One common reason is that the parameter servers have unbalanced load and some heavily-loaded parameter servers have reached capacity. There can also be multiple root causes. Some simple methods to mitigate this issue are to:\n",
         "\n",
         "1. Shard your large model variables via specifying a `variable_partitioner` when constructing a `ParameterServerStrategy`.\n",
-        "2. Avoid creating a hotspot variable that is required by all parameter servers in a single step if possible. For example, use a constant learning rate or subclass `tf.keras.optimizers.schedules.LearningRateSchedule` in optimizers since the default behavior is that the learning rate will become a variable placed on a particular parameter server and requested by all other parameter servers in each step.\n",
+        "2. Avoid creating a hotspot variable that is required by all parameter servers in a single step, by both:\n",
+        "\n",
+        "  1) Using a constant learning rate or subclass `tf.keras.optimizers.schedules.LearningRateSchedule` in optimizers. This is because the default behavior is that the learning rate will become a variable placed on a particular parameter server, and requested by all other parameter servers in each step); and\n",
+        "\n",
+        "  2) Using a `tf.keras.optimizers.legacy.Optimizer` (the standard `tf.keras.optimizers.Optimizer`s could still lead to hotspot variables).\n",
         "3. Shuffle your large vocabularies before passing them to Keras preprocessing layers.\n",
         "\n",
         "Another possible reason for performance issues is the coordinator. The implementation of `schedule`/`join` is Python-based and thus may have threading overhead. Also, the latency between the coordinator and the workers can be large. If this is the case:\n",