Use TensorBoard notebook extension

Minwoo Park · copybara-github · commit e30e042eb5f1 · 2024-10-10T09:07:55.000-07:00
PiperOrigin-RevId: 683778406
diff --git a/notebooks/community/model_garden/model_garden_gemma2_finetuning_on_vertex.ipynb b/notebooks/community/model_garden/model_garden_gemma2_finetuning_on_vertex.ipynb
@@ -132,10 +132,8 @@
         "# @markdown 3. For serving, **[click here](https://console.cloud.google.com/iam-admin/quotas?location=us-central1&metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_l4_gpus)** to check if your project already has the required 1 L4 GPU in the us-central1 region.  If yes, then run this notebook in the us-central1 region. If you need more L4 GPUs for your project, then you can follow [these instructions](https://cloud.google.com/docs/quotas/view-manage#viewing_your_quota_console) to request more. Alternatively, if you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus).\n",
         "\n",
         "# @markdown > | Machine Type | Accelerator Type | Recommended Regions |\n",
-        "# @markdown | ----------- | ----------- | ----------- | \n",
+        "# @markdown | ----------- | ----------- | ----------- |\n",
         "# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |\n",
-        "# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1 |\n",
-        "# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1 |\n",
         "# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, us-west1, europe-west4, asia-southeast1 |\n",
         "\n",
         "# @markdown 4. **[Optional]** [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs. Set the BUCKET_URI for the experiment environment. The specified Cloud Storage bucket (`BUCKET_URI`) should be located in the same region as where the notebook was launched. Note that a multi-region bucket (eg. \"us\") is not considered a match for a single region covered by the multi-region range (eg. \"us-central1\"). If not set, a unique GCS bucket will be created instead.\n",
@@ -180,7 +178,7 @@
         "# Cloud Storage bucket for storing the experiment artifacts.\n",
         "# A unique GCS bucket will be created for the purpose of this notebook. If you\n",
         "# prefer using your own GCS bucket, change the value yourself below.\n",
-        "now = datetime.now().strftime(\"%Y%m%d%H%M%S\")\n",
+        "now = datetime.datetime.now().strftime(\"%Y%m%d%H%M%S\")\n",
         "BUCKET_NAME = \"/\".join(BUCKET_URI.split(\"/\")[:3])\n",
         "\n",
         "if BUCKET_URI is None or BUCKET_URI.strip() == \"\" or BUCKET_URI == \"gs://\":\n",
@@ -582,14 +580,22 @@
       "outputs": [],
       "source": [
         "# @title Run TensorBoard\n",
-        "# @markdown This section shows how to launch TensorBoard in a [Cloud Shell](https://cloud.google.com/shell/docs).\n",
-        "# @markdown 1. Click the Cloud Shell icon(![terminal](https://github.com/google/material-design-icons/blob/master/png/action/terminal/materialicons/24dp/1x/baseline_terminal_black_24dp.png?raw=true)) on the top right to open the Cloud Shell.\n",
-        "# @markdown 2. Copy the `tensorboard` command shown below by running this cell.\n",
-        "# @markdown 3. Paste and run the command in the Cloud Shell to launch TensorBoard.\n",
-        "# @markdown 4. Once the command runs (You may have to click `Authorize` if prompted), click the link starting with `http://localhost`.\n",
-        "\n",
+        "# @markdown This section launches TensorBoard and displays it. You can re-run the cell to display an updated information about the training job.\n",
+        "# @markdown See the link to the training job in the above cell to see the status of the Custom Training Job.\n",
         "# @markdown Note: You may need to wait around 10 minutes after the job starts in order for the TensorBoard logs to be written to the GCS bucket.\n",
-        "print(f\"Command to copy: tensorboard --logdir {base_output_dir}/logs\")\n"
+        "\n",
+        "now = datetime.datetime.now(tz=datetime.timezone.utc)\n",
+        "\n",
+        "if train_job.end_time is not None:\n",
+        "    min_since_end = int((now - train_job.end_time).total_seconds() // 60)\n",
+        "    print(f\"Training Job finished {min_since_end} minutes ago.\")\n",
+        "\n",
+        "if train_job.has_failed:\n",
+        "    print(\n",
+        "        \"The job has failed. See the link to the training job in the above cell to see the logs.\"\n",
+        "    )\n",
+        "\n",
+        "%tensorboard --logdir {base_output_dir}/logs"
       ]
     },
     {
@@ -819,11 +825,12 @@
         "# endpoint = aiplatform.Endpoint(aip_endpoint_name)\n",
         "\n",
         "prompt = \"What is a car?\"  # @param {type: \"string\"}\n",
-        "# @markdown If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, by lowering `max_tokens`.\n",
+        "# @markdown If you encounter an issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, by lowering `max_tokens`.\n",
         "max_tokens = 50  # @param {type:\"integer\"}\n",
         "temperature = 1.0  # @param {type:\"number\"}\n",
         "top_p = 1.0  # @param {type:\"number\"}\n",
         "top_k = 1  # @param {type:\"integer\"}\n",
+        "# @markdown Set `raw_response` to `True` to obtain the raw model output. Set `raw_response` to `False` to apply additional formatting in the structure of `\"Prompt:\\n{prompt.strip()}\\nOutput:\\n{output}\"`.\n",
         "raw_response = False  # @param {type:\"boolean\"}\n",
         "\n",
         "# Overrides parameters for inferences.\n",
diff --git a/notebooks/community/model_garden/model_garden_pytorch_llama3_1_finetuning.ipynb b/notebooks/community/model_garden/model_garden_pytorch_llama3_1_finetuning.ipynb
@@ -129,10 +129,8 @@
         "# @markdown 3. For serving, **[click here](https://console.cloud.google.com/iam-admin/quotas?location=us-central1&metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_l4_gpus)** to check if your project already has the required 1 L4 GPU in the us-central1 region.  If yes, then run this notebook in the us-central1 region. If you need more L4 GPUs for your project, then you can follow [these instructions](https://cloud.google.com/docs/quotas/view-manage#viewing_your_quota_console) to request more. Alternatively, if you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus).\n",
         "\n",
         "# @markdown > | Machine Type | Accelerator Type | Recommended Regions |\n",
-        "# @markdown | ----------- | ----------- | ----------- | \n",
+        "# @markdown | ----------- | ----------- | ----------- |\n",
         "# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |\n",
-        "# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1 |\n",
-        "# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1 |\n",
         "# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, us-west1, europe-west4, asia-southeast1 |\n",
         "\n",
         "# @markdown 4. **[Optional]** [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs. Set the BUCKET_URI for the experiment environment. The specified Cloud Storage bucket (`BUCKET_URI`) should be located in the same region as where the notebook was launched. Note that a multi-region bucket (eg. \"us\") is not considered a match for a single region covered by the multi-region range (eg. \"us-central1\"). If not set, a unique GCS bucket will be created instead.\n",
@@ -177,7 +175,7 @@
         "# Cloud Storage bucket for storing the experiment artifacts.\n",
         "# A unique GCS bucket will be created for the purpose of this notebook. If you\n",
         "# prefer using your own GCS bucket, change the value yourself below.\n",
-        "now = datetime.now().strftime(\"%Y%m%d%H%M%S\")\n",
+        "now = datetime.datetime.now().strftime(\"%Y%m%d%H%M%S\")\n",
         "BUCKET_NAME = \"/\".join(BUCKET_URI.split(\"/\")[:3])\n",
         "\n",
         "if BUCKET_URI is None or BUCKET_URI.strip() == \"\" or BUCKET_URI == \"gs://\":\n",
@@ -642,19 +640,28 @@
       "outputs": [],
       "source": [
         "# @title Run TensorBoard\n",
-        "# @markdown This section shows how to launch TensorBoard in a [Cloud Shell](https://cloud.google.com/shell/docs).\n",
-        "# @markdown 1. Click the Cloud Shell icon(![terminal](https://github.com/google/material-design-icons/blob/master/png/action/terminal/materialicons/24dp/1x/baseline_terminal_black_24dp.png?raw=true)) on the top right to open the Cloud Shell.\n",
-        "# @markdown 2. Copy the `tensorboard` command shown below by running this cell.\n",
-        "# @markdown 3. Paste and run the command in the Cloud Shell to launch TensorBoard.\n",
-        "# @markdown 4. Once the command runs (You may have to click `Authorize` if prompted), click the link starting with `http://localhost`.\n",
-        "\n",
+        "# @markdown This section launches TensorBoard and displays it. You can re-run the cell to display an updated information about the training job.\n",
+        "# @markdown See the link to the training job in the above cell to see the status of the Custom Training Job.\n",
         "# @markdown Note: You may need to wait around 10 minutes after the job starts in order for the TensorBoard logs to be written to the GCS bucket.\n",
-        "print(f\"Command to copy: tensorboard --logdir {base_output_dir}/logs\")\n"
+        "\n",
+        "now = datetime.datetime.now(tz=datetime.timezone.utc)\n",
+        "\n",
+        "if train_job.end_time is not None:\n",
+        "    min_since_end = int((now - train_job.end_time).total_seconds() // 60)\n",
+        "    print(f\"Training Job finished {min_since_end} minutes ago.\")\n",
+        "\n",
+        "if train_job.has_failed:\n",
+        "    print(\n",
+        "        \"The job has failed. See the link to the training job in the above cell to see the logs.\"\n",
+        "    )\n",
+        "\n",
+        "%tensorboard --logdir {base_output_dir}/logs"
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
+      "language": "python",
       "metadata": {
         "cellView": "form",
         "id": "qmHW6m8xG_4U"
@@ -860,6 +867,7 @@
     {
       "cell_type": "code",
       "execution_count": null,
+      "language": "python",
       "metadata": {
         "cellView": "form",
         "id": "2UYUNn60G_4U"
diff --git a/notebooks/community/model_garden/model_garden_pytorch_mistral_peft_tuning.ipynb b/notebooks/community/model_garden/model_garden_pytorch_mistral_peft_tuning.ipynb
@@ -126,10 +126,8 @@
         "# @markdown 3. For serving, **[click here](https://console.cloud.google.com/iam-admin/quotas?location=us-central1&metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_l4_gpus)** to check if your project already has the required 1 L4 GPU in the us-central1 region.  If yes, then run this notebook in the us-central1 region. If you need more L4 GPUs for your project, then you can follow [these instructions](https://cloud.google.com/docs/quotas/view-manage#viewing_your_quota_console) to request more. Alternatively, if you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus).\n",
         "\n",
         "# @markdown > | Machine Type | Accelerator Type | Recommended Regions |\n",
-        "# @markdown | ----------- | ----------- | ----------- | \n",
+        "# @markdown | ----------- | ----------- | ----------- |\n",
         "# @markdown | a2-ultragpu-1g | 1 NVIDIA_A100_80GB | us-central1, us-east4, europe-west4, asia-southeast1, us-east4 |\n",
-        "# @markdown | a3-highgpu-2g | 2 NVIDIA_H100_80GB | us-west1, asia-southeast1 |\n",
-        "# @markdown | a3-highgpu-4g | 4 NVIDIA_H100_80GB | us-west1, asia-southeast1 |\n",
         "# @markdown | a3-highgpu-8g | 8 NVIDIA_H100_80GB | us-central1, us-west1, europe-west4, asia-southeast1 |\n",
         "\n",
         "# @markdown 4. **[Optional]** [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs. Set the BUCKET_URI for the experiment environment. The specified Cloud Storage bucket (`BUCKET_URI`) should be located in the same region as where the notebook was launched. Note that a multi-region bucket (eg. \"us\") is not considered a match for a single region covered by the multi-region range (eg. \"us-central1\"). If not set, a unique GCS bucket will be created instead.\n",
@@ -174,7 +172,7 @@
         "# Cloud Storage bucket for storing the experiment artifacts.\n",
         "# A unique GCS bucket will be created for the purpose of this notebook. If you\n",
         "# prefer using your own GCS bucket, change the value yourself below.\n",
-        "now = datetime.now().strftime(\"%Y%m%d%H%M%S\")\n",
+        "now = datetime.datetime.now().strftime(\"%Y%m%d%H%M%S\")\n",
         "BUCKET_NAME = \"/\".join(BUCKET_URI.split(\"/\")[:3])\n",
         "\n",
         "if BUCKET_URI is None or BUCKET_URI.strip() == \"\" or BUCKET_URI == \"gs://\":\n",
@@ -559,14 +557,22 @@
       "outputs": [],
       "source": [
         "# @title Run TensorBoard\n",
-        "# @markdown This section shows how to launch TensorBoard in a [Cloud Shell](https://cloud.google.com/shell/docs).\n",
-        "# @markdown 1. Click the Cloud Shell icon(![terminal](https://github.com/google/material-design-icons/blob/master/png/action/terminal/materialicons/24dp/1x/baseline_terminal_black_24dp.png?raw=true)) on the top right to open the Cloud Shell.\n",
-        "# @markdown 2. Copy the `tensorboard` command shown below by running this cell.\n",
-        "# @markdown 3. Paste and run the command in the Cloud Shell to launch TensorBoard.\n",
-        "# @markdown 4. Once the command runs (You may have to click `Authorize` if prompted), click the link starting with `http://localhost`.\n",
-        "\n",
+        "# @markdown This section launches TensorBoard and displays it. You can re-run the cell to display an updated information about the training job.\n",
+        "# @markdown See the link to the training job in the above cell to see the status of the Custom Training Job.\n",
         "# @markdown Note: You may need to wait around 10 minutes after the job starts in order for the TensorBoard logs to be written to the GCS bucket.\n",
-        "print(f\"Command to copy: tensorboard --logdir {base_output_dir}/logs\")\n"
+        "\n",
+        "now = datetime.datetime.now(tz=datetime.timezone.utc)\n",
+        "\n",
+        "if train_job.end_time is not None:\n",
+        "    min_since_end = int((now - train_job.end_time).total_seconds() // 60)\n",
+        "    print(f\"Training Job finished {min_since_end} minutes ago.\")\n",
+        "\n",
+        "if train_job.has_failed:\n",
+        "    print(\n",
+        "        \"The job has failed. See the link to the training job in the above cell to see the logs.\"\n",
+        "    )\n",
+        "\n",
+        "%tensorboard --logdir {base_output_dir}/logs"
       ]
     },
     {
@@ -777,11 +783,12 @@
         "# endpoint = aiplatform.Endpoint(aip_endpoint_name)\n",
         "\n",
         "prompt = \"What is a car?\"  # @param {type: \"string\"}\n",
-        "# @markdown If you encounter the issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, by lowering `max_tokens`.\n",
+        "# @markdown If you encounter an issue like `ServiceUnavailable: 503 Took too long to respond when processing`, you can reduce the maximum number of output tokens, by lowering `max_tokens`.\n",
         "max_tokens = 50  # @param {type:\"integer\"}\n",
         "temperature = 1.0  # @param {type:\"number\"}\n",
         "top_p = 1.0  # @param {type:\"number\"}\n",
         "top_k = 1  # @param {type:\"integer\"}\n",
+        "# @markdown Set `raw_response` to `True` to obtain the raw model output. Set `raw_response` to `False` to apply additional formatting in the structure of `\"Prompt:\\n{prompt.strip()}\\nOutput:\\n{output}\"`.\n",
         "raw_response = False  # @param {type:\"boolean\"}\n",
         "\n",
         "# Overrides parameters for inferences.\n",
diff --git a/notebooks/community/model_garden/model_garden_pytorch_mixtral_peft_tuning.ipynb b/notebooks/community/model_garden/model_garden_pytorch_mixtral_peft_tuning.ipynb