Update notebook to use embedding gemma (#36035)

damccorm · web-flow · commit 32d53ab22e5a · 2025-09-04T12:16:16.000-04:00
* Add auth for models

* Update to gemma embedding model
diff --git a/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb b/examples/notebooks/beam-ml/data_preprocessing/huggingface_text_embeddings.ipynb
@@ -45,7 +45,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "# Generate text embeddings by using Hugging Face Hub models\n",
+        "# Generate text embeddings by using the EmbeddingGemma model from Hugging Face\n",
         "\n",
         "<table align=\"left\">\n",
         "  <td>\n",
@@ -75,6 +75,8 @@
         "\n",
         "This notebook uses Apache Beam's `MLTransform` to generate embeddings from text data.\n",
         "\n",
+        "Using a small, highly efficient open model like EmbeddingGemma at the core of your pipeline makes the entire process self-contained, which can simplify management by eliminating the need for external network calls to other services for the embedding step. Because it's an open model, it can be hosted entirely within Dataflow. This provides the confidence to securely process large-scale, private datasets. For more information about the model, see the [model card](https://huggingface.co/google/embeddinggemma-300m)\n",
+        "\n",
         "Hugging Face's [`SentenceTransformers`](https://huggingface.co/sentence-transformers) framework uses Python to generate sentence, text, and image embeddings.\n",
         "\n",
         "To generate text embeddings that use Hugging Face models and `MLTransform`, use the `SentenceTransformerEmbeddings` module to specify the model configuration.\n"
@@ -120,6 +122,28 @@
       "execution_count": 29,
       "outputs": []
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Authenticate with HuggingFace\n",
+        "\n",
+        "To ensure that you can pull the correct model, authenticate with HuggingFace by following the prompts in the cell."
+      ],
+      "metadata": {
+        "id": "kXDM8C7d3nPW"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!hf auth login"
+      ],
+      "metadata": {
+        "id": "jVxSi2jS3M3c"
+      },
+      "execution_count": 29,
+      "outputs": []
+    },
     {
       "cell_type": "markdown",
       "source": [
@@ -170,7 +194,7 @@
         "    {'x': \"Should I sign up for Medicare Part B if I have Veterans' Benefits?\"}\n",
         "]\n",
         "\n",
-        "text_embedding_model_name = 'sentence-transformers/sentence-t5-large'\n",
+        "text_embedding_model_name = 'google/embeddinggemma-300m'\n",
         "\n",
         "\n",
         "# helper function that returns a dict containing only first\n",
@@ -191,7 +215,7 @@
       "source": [
         "\n",
         "### Generate text embeddings\n",
-        "This example uses the model `sentence-transformers/sentence-t5-large` to generate text embeddings. The model uses only the encoder from a `T5-large model`. The weights are stored in FP16. For more information about the model, see [Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models](https://arxiv.org/abs/2108.08877)."
+        "This example uses the model `google/embeddinggemma-300m` to generate text embeddings. For more information about the model, see [the model card](https://huggingface.co/google/embeddinggemma-300m)."
       ],
       "metadata": {
         "id": "SApMmlRLRv_e"