Update DTensor Colab to tf.train.Checkpoint

samatbryan · copybara-github · commit cd180090ad92 · 2022-11-21T11:53:35.000-08:00
As of Tensorflow 2.11, DTensor works out of the box with tf.train.Checkpoint. One can save and restore sharded and replicated DTensors with tf.train.Checkpoint. Update the colab to reflect the changes.

PiperOrigin-RevId: 490036162
diff --git a/site/en/tutorials/distribute/dtensor_ml_tutorial.ipynb b/site/en/tutorials/distribute/dtensor_ml_tutorial.ipynb
@@ -697,9 +697,9 @@
       "source": [
         "### Checkpointing\n",
         "\n",
-        "You can checkpoint a DTensor model using `dtensor.DTensorCheckpoint`. The format of a DTensor checkpoint is fully compatible with a Standard TensorFlow Checkpoint. There is ongoing work to consolidate `dtensor.DTensorCheckpoint` into `tf.train.Checkpoint`.\n",
+        "You can checkpoint a DTensor model using `tf.train.Checkpoint` out of the box. Saving and restoring sharded DVariables will perform an efficient sharded save and restore. Currently, when using `tf.train.Checkpoint.save` and `tf.train.Checkpoint.restore`, all DVariables must be on the same host mesh, and DVariables and regular variables cannot be saved together. You can learn more about checkpointing in [this guide](../../guide/checkpoint.ipynb).\n",
         "\n",
-        "When a DTensor checkpoint is restored, `Layout`s of variables can be different from when the checkpoint is saved. This tutorial makes use of this feature to continue the training in the Model Parallel training and Spatial Parallel training sections.\n"
+        "When a DTensor checkpoint is restored, `Layout`s of variables can be different from when the checkpoint is saved. That is, saving DTensor models is layout- and mesh-agnostic, and only affects the efficiency of sharded saving. You can save a DTensor model with one mesh and layout and restore it on a different mesh and layout. This tutorial makes use of this feature to continue the training in the Model Parallel training and Spatial Parallel training sections.\n"
       ]
     },
     {
@@ -712,15 +712,15 @@
       "source": [
         "CHECKPOINT_DIR = tempfile.mkdtemp()\n",
         "\n",
-        "def start_checkpoint_manager(mesh, model):\n",
-        "  ckpt = dtensor.DTensorCheckpoint(mesh, root=model)\n",
+        "def start_checkpoint_manager(model):\n",
+        "  ckpt = tf.train.Checkpoint(root=model)\n",
         "  manager = tf.train.CheckpointManager(ckpt, CHECKPOINT_DIR, max_to_keep=3)\n",
         "\n",
         "  if manager.latest_checkpoint:\n",
         "    print(\"Restoring a checkpoint\")\n",
         "    ckpt.restore(manager.latest_checkpoint).assert_consumed()\n",
         "  else:\n",
-        "    print(\"new training\")\n",
+        "    print(\"New training\")\n",
         "  return manager\n"
       ]
     },
@@ -746,7 +746,7 @@
       "outputs": [],
       "source": [
         "num_epochs = 2\n",
-        "manager = start_checkpoint_manager(mesh, model)\n",
+        "manager = start_checkpoint_manager(model)\n",
         "\n",
         "for epoch in range(num_epochs):\n",
         "  step = 0\n",
@@ -839,7 +839,7 @@
       "outputs": [],
       "source": [
         "num_epochs = 2\n",
-        "manager = start_checkpoint_manager(mesh, model)\n",
+        "manager = start_checkpoint_manager(model)\n",
         "\n",
         "for epoch in range(num_epochs):\n",
         "  step = 0\n",
@@ -932,7 +932,7 @@
       "source": [
         "num_epochs = 2\n",
         "\n",
-        "manager = start_checkpoint_manager(mesh, model)\n",
+        "manager = start_checkpoint_manager(model)\n",
         "for epoch in range(num_epochs):\n",
         "  step = 0\n",
         "  metrics = {'epoch': epoch}\n",
@@ -956,11 +956,9 @@
       "source": [
         "## SavedModel and DTensor\n",
         "\n",
-        "The integration of DTensor and SavedModel is still under development. This section only describes the current status quo for TensorFlow 2.9.0.\n",
+        "The integration of DTensor and SavedModel is still under development. \n",
         "\n",
-        "As of TensorFlow 2.9.0, `tf.saved_model` only accepts DTensor models with fully replicated variables.\n",
-        "\n",
-        "As a workaround, you can convert a DTensor model to a fully replicated one by reloading a checkpoint. However, after a model is saved, all DTensor annotations are lost and the saved signatures can only be used with regular Tensors, not DTensors."
+        "As of TensorFlow `2.11`, `tf.saved_model` can save sharded and replicated DTensor models, and saving will do an efficient sharded save on different devices of the mesh. However, after a model is saved, all DTensor annotations are lost and the saved signatures can only be used with regular Tensors, not DTensors."
       ]
     },
     {
@@ -975,7 +973,7 @@
         "mlp = MLP([dtensor.Layout([dtensor.UNSHARDED, dtensor.UNSHARDED], mesh), \n",
         "           dtensor.Layout([dtensor.UNSHARDED, dtensor.UNSHARDED], mesh)])\n",
         "\n",
-        "manager = start_checkpoint_manager(mesh, mlp)\n",
+        "manager = start_checkpoint_manager(mlp)\n",
         "\n",
         "model_for_saving = tf.keras.Sequential([\n",
         "  text_vectorization,\n",