Text classification notebook refactor

GiovanniGiacometti · GiovanniGiacometti · commit 950bd3580710 · 2024-10-02T09:19:12.000+02:00
Minor comments in image classification
diff --git a/notebooks/image_classification.ipynb b/notebooks/image_classification.ipynb
@@ -622,9 +622,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, we are ready to upload production data.\n",
-    "Production data can be uploaded asynchronously, that means that we can upload each data category whenever, it is available without waiting for the others.\n",
-    "This is specially true for *target* data that usually are available with an amount of delay."
+    "Now, we are ready to upload production data."
    ]
   },
   {
@@ -643,6 +641,15 @@
     "client.wait_job_completion(job_id=job_id)\n",
     "print(f'Job {job_id} completed')"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Notice that production data can be uploaded asynchronously, which means that we can upload each data category whenever it is available, without waiting for the others.\n",
+    "This is specially true for *target* data, that usually are available with an amount of delay.\n"
+   ]
   }
  ],
  "metadata": {
diff --git a/notebooks/text_classification.ipynb b/notebooks/text_classification.ipynb
@@ -6,11 +6,16 @@
    "source": [
     "# Text classification\n",
     "\n",
-    "This notebook shows how to use ML cube Platform with text data.\n",
-    "We use a Huggingface dataset and trained model for Sentiment classification.\n",
-    "The dataset contains train, validation and test sets, we use train as reference dataset while validation and test as production data.\n",
-    "Of course, in a real scenario all those dataset will be part of historical/reference data and production will come from the production environment after the deployment of the algorithm.\n",
+    "This notebook shows how to use the ML cube Platform with text data.\n",
+    "We utilize a Huggingface dataset and a pre-trained model for Sentiment classification. We load the validation data and split the dataset in two parts, using the first as reference data and the second as production data. \n",
     "\n",
+    "In a real-world scenario, the training and validation datasets would be considered historical/reference data, while production data would come from the production environment after the model's deployment."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "**With this example you will learn:**\n",
     "- how to create a text classification task\n",
     "- how to define a data schema\n",
@@ -21,9 +26,9 @@
     "\n",
     "**Requirements**\n",
     "\n",
-    "In order to properly run this notebook the Python environment has those requirements.\n",
+    "These are the dependencies your Python environment is required to have in order to properly run this notebook.\n",
     "```\n",
-    "ml3-platform-sdk>=0.0.15\n",
+    "ml3-platform-sdk>=0.0.22\n",
     "transformers[torch]==4.41.2\n",
     "torch==2.2.0\n",
     "datasets==2.15.0\n",
@@ -44,7 +49,16 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -72,7 +86,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -101,27 +115,37 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
-    "dataset = load_dataset('cardiffnlp/tweet_eval', name='sentiment', )"
+    "complete_dataset = load_dataset('cardiffnlp/tweet_eval', name='sentiment', split='validation[:50%]')"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [],
    "source": [
-    "def sample_dataset(dataset, fraction=0.05, seed=42):\n",
+    "def sample_dataset(dataset, reference_portion=0.5, first_production_portion=0.5, seed=42):\n",
     "    sampled_dataset = DatasetDict()\n",
-    "    for split in dataset.keys():\n",
-    "        sampled_dataset[split] = dataset[split].train_test_split(test_size=fraction, seed=seed)['test']\n",
+    "    \n",
+    "    # Split the dataset into reference and production\n",
+    "\n",
+    "    split = dataset.train_test_split(test_size=reference_portion, seed=seed)\n",
+    "\n",
+    "    sampled_dataset['reference'] = split['train']\n",
+    "\n",
+    "    split_2 = split['test'].train_test_split(test_size=first_production_portion, seed=seed)\n",
+    "\n",
+    "    sampled_dataset['first_production'] = split_2['train']\n",
+    "    sampled_dataset['second_production'] = split_2['test']\n",
+    "\n",
     "    return sampled_dataset\n",
     "\n",
     "# Perform the sampling\n",
-    "dataset = sample_dataset(dataset)"
+    "dataset = sample_dataset(complete_dataset)"
    ]
   },
   {
@@ -130,7 +154,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "len(dataset['train']['text']), len(dataset['validation']['text']), len(dataset['test']['text'])"
+    "len(dataset['reference']['text']), len(dataset['first_production']['text']), len(dataset['second_production']['text'])"
    ]
   },
   {
@@ -160,7 +184,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -175,7 +199,7 @@
     "We use local data sources to upload data, hence, we need to create local files that will be shared with ML cube Platform.\n",
     "With text data we can upload data in json as a list of objects containing three fields: timestamp, sample-id, text.\n",
     "Text data can be composed of only text sequences but also with their embeddings (optional).\n",
-    "While target and predictions can be sent as csv tabular files.\n",
+    "On the other hand, target and predictions can be sent as csv tabular files.\n",
     "\n",
     "In ML cube Platform data are uploaded separately for each category:\n",
     "- **inputs:** TextData object in json format\n",
@@ -185,13 +209,13 @@
     "\n",
     "When dealing with unstructured data like text it is possible to send them in three ways:\n",
     "1. By sending only embeddings i.e., a numerical representation of the text sample as a vector, using `EmbeddingData`;\n",
-    "2. By sending only unstructured text, using `TextData`. In this case ML cube Platform will create the numerical representation using internal encoders;\n",
-    "3. By sending ustructured text along with embeddings using `TextData` with `embedding_source` attribute. This more complete option has two benefits, the first is the usage of personal embedder that usually is focused on the domain instead of a general one; the other is using text to extract additional metrics and to have full capability in the web application."
+    "2. By sending only the raw text, using `TextData`. In this case ML cube Platform will create the numerical representation using internal encoders;\n",
+    "3. By sending the raw text along with the embeddings, using `TextData` with the  `embedding_source` attribute. This more complete option has two benefits:it allows the usage of a personal embedder, which is usually focused on the domain rather than a general one, and it enables the extraction of additional metrics from the text, providing more functionalities in the web application."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -342,14 +366,14 @@
     "    starting_id,\n",
     "    starting_timestamp,\n",
     ") = build_data_objects(\n",
-    "    dataset['train'],\n",
+    "    dataset['reference'],\n",
     "    sentiment_pip,\n",
     "    embedder,\n",
     "    model_name,\n",
     "    model_version,\n",
     "    training_initial_sample_id,\n",
     "    training_initial_timestamp,\n",
-    "    'train'\n",
+    "    'reference'\n",
     ")\n",
     "training_end_timestamp = starting_timestamp - 120"
    ]
@@ -367,7 +391,7 @@
     "    starting_id,\n",
     "    starting_timestamp,\n",
     ") = build_data_objects(\n",
-    "    dataset['validation'],\n",
+    "    dataset['first_production'],\n",
     "    sentiment_pip,\n",
     "    embedder,\n",
     "    model_name,\n",
@@ -391,7 +415,7 @@
     "    starting_id,\n",
     "    starting_timestamp,\n",
     ") = build_data_objects(\n",
-    "    dataset['test'],\n",
+    "    dataset['second_production'],\n",
     "    sentiment_pip,\n",
     "    embedder,\n",
     "    model_name,\n",
@@ -410,18 +434,18 @@
     "\n",
     "The data schema specifies the type of data present in the task with their specific names.\n",
     "A data schema must contain:\n",
-    "- *sample id* column that is used to uniquely identify each sample\n",
-    "- *timestamp* column that is used to order samples\n",
-    "- *input* column that specify the nature of the input. In this case TEXT\n",
-    "- *input additional embedding* optional column for additional embedding of the text data\n",
-    "- *target* column that specify the nature of the target. In this case categorical with three values\n",
+    "- *sample id*, column that is used to uniquely identify each sample\n",
+    "- *timestamp*, column that is used to order samples\n",
+    "- *input*, column that specifies the nature of the input. In this case, it's a string, as we are dealing with text data.\n",
+    "- *input additional embedding*, optional column for the embedding of the text data\n",
+    "- *target*, column that specifies the nature of the target. In this case, categorical with three possible values\n",
     "\n",
-    "Prediction column must not be specified because it will be automatically added during the model creation with the name like MODEL_NAME@MODEL_VERSION"
+    "The prediction column must not be specified because it will be automatically added during the model creation, with a name like `MODEL_NAME@MODEL_VERSION`"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -491,13 +515,13 @@
     "- **optional_target:** if the target can be missing for production data.\n",
     "    We assume that reference data being the training one always have the target.\n",
     "    However, it is possible that ofr other historical data or for production data target is not available, enabling this option, ML cube Platform does not force its presence and it will not stop breaks the jobs.\n",
-    "- **text_language:** in case of Text data structure it is mandatory to specify the language used in the task.\n",
-    "- **cost_info**, this optional field allows to specify the costs of the error of the model and will be used during the retraining report computation."
+    "- **text_language:** if the target can be missing for production data. We assume that reference data, being the training data, always have the target.\n",
+    "    However, it is possible that target is not available in other historical data or production data. By enabling the optional target option, the ML cube Platform will not check its presence in the data sent to the platform."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -514,7 +538,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -525,14 +549,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "After we added the data schema, we are able to create our model.\n",
-    "\n",
     "A model is uniquely identified by `name` and `model version`."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 18,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -553,7 +575,7 @@
     "Training data are uploaded as historical data i.e., any data that do not come from production.\n",
     "Then we indicate them as reference data of our model in order to set up the detection algorithms.\n",
     "\n",
-    "Note that it is possible to add other historical data in any time."
+    "Note that it is possible to add other historical data at any time."
    ]
   },
   {
@@ -593,8 +615,8 @@
    "metadata": {},
    "source": [
     "Now, we are ready to upload production data.\n",
-    "Production data can be uploaded asynchronously, that means that we can upload each data category whenever, it is available without waiting for the others.\n",
-    "This is specially true for *target* data that usually are available with an amount of delay."
+    "Notice that production data can be uploaded asynchronously, which means that we can upload each data category whenever it is available, without waiting for the others.\n",
+    "This is specially true for *target* data, that usually are available with an amount of delay."
    ]
   },
   {