Rag notebook refactor

GiovanniGiacometti · GiovanniGiacometti · commit ca3866733d9e · 2024-10-02T09:20:09.000+02:00
diff --git a/notebooks/rag.ipynb b/notebooks/rag.ipynb
@@ -15,7 +15,7 @@
     "- context: the context retrieved by the retrieval system\n",
     "- answer: the output of the model that tries to answer the question\n",
     "\n",
-    "The presence of these 3 elements allows us to simulate a full RAG system without actually setting up the system.\n"
+    "The presence of these 3 elements allows us to simulate a full RAG system without actually setting up the system."
    ]
   },
   {
@@ -34,7 +34,7 @@
     "\n",
     "These are the dependencies your Python environment is required to have in order to properly run this notebook.\n",
     "```\n",
-    "ml3-platform-sdk>=0.0.17\n",
+    "ml3-platform-sdk>=0.0.22\n",
     "torch==2.2.0\n",
     "datasets==2.15.0\n",
     "sentence-transformers==3.0.1\n",
@@ -54,7 +54,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -77,13 +77,25 @@
     "User Inputs"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "URL = 'https://api.platform.mlcube.com'\n",
+    "API_KEY = \"\"\n",
+    "PROJECT_ID = ''\n",
+    "model_name = 'mymodel'\n",
+    "model_version = 'v0.0.1'"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Dataset, model and predictions\n",
-    "Download dataset and model using Huggingface api.\n",
-    "After the dataset and the model are downloaded we run the model to get predictions."
+    "Download dataset and model using Huggingface api."
    ]
   },
   {
@@ -95,34 +107,37 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
-    "complete_dataset = load_dataset(\"neural-bridge/rag-dataset-12000\")\n",
-    "\n",
     "USER_INPUT_COL_NAME = 'question'\n",
     "CONTEXT_COL_NAME = 'context'\n",
-    "ANSWER_COL_NAME = 'answer'"
+    "ANSWER_COL_NAME = 'answer'\n",
+    "\n",
+    "complete_dataset = load_dataset(\"neural-bridge/rag-dataset-12000\", split=\"train[:10%]\").filter(lambda x: all(x[col] is not None for col in [USER_INPUT_COL_NAME, CONTEXT_COL_NAME, ANSWER_COL_NAME]))"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
-    "def sample_dataset(dataset, fraction=0.1, seed=42):\n",
+    "def sample_dataset(dataset, reference_portion=0.5, first_production_portion=0.5, seed=42):\n",
     "    sampled_dataset = DatasetDict()\n",
     "    \n",
-    "    # Split train data\n",
-    "    train_split = dataset['train'].train_test_split(test_size=0.5, seed=seed)\n",
-    "    \n",
-    "    sampled_dataset['train'] = train_split['train'].train_test_split(test_size=fraction, seed=seed)['test']\n",
-    "    sampled_dataset['validation'] = train_split['test'].train_test_split(test_size=fraction, seed=seed)['test']\n",
-    "        \n",
-    "    # Split test data\n",
-    "    sampled_dataset['test'] = dataset['test'].train_test_split(test_size=fraction, seed=seed)['test']\n",
+    "    # Split the dataset into reference and production\n",
+    "\n",
+    "    split = dataset.train_test_split(test_size=reference_portion, seed=seed)\n",
+    "\n",
+    "    sampled_dataset['reference'] = split['train']\n",
+    "\n",
+    "    split_2 = split['test'].train_test_split(test_size=first_production_portion, seed=seed)\n",
+    "\n",
+    "    sampled_dataset['first_production'] = split_2['train']\n",
+    "    sampled_dataset['second_production'] = split_2['test']\n",
+    "\n",
     "    return sampled_dataset\n",
     "\n",
     "# Perform the sampling\n",
@@ -135,7 +150,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "len(dataset['train']['context']), len(dataset['validation']['context']), len(dataset['test']['context'])"
+    "len(dataset[\"reference\"]), len(dataset[\"first_production\"]), len(dataset[\"second_production\"])"
    ]
   },
   {
@@ -156,7 +171,7 @@
     "\n",
     "Uploading data coming from a RAG system is equivalent to uploading text data (refer to this [notebook](https://colab.research.google.com/github/ml-cube/ml3-platform-docs/blob/main/notebooks/text_classification.ipynb) for further information). \n",
     "\n",
-    "Data needs to be stored in a json file as a list of objects. Each object must contain two mandatory fields, namely the timestamp and the sample-id, along with other the other fields that represent the data (e.g. question and context for input data, answer for predcition data).\n",
+    "Data needs to be stored in a json file as a list of objects. Each object must contain two mandatory fields, namely the timestamp and the sample-id, along with other the other fields that represent the data (e.g. question and context for input data, answer for prediction data).\n",
     "\n",
     "When dealing with unstructured data like text it is possible to send them in three ways:\n",
     "1. By sending only embeddings i.e., a numerical representation of the text sample as a vector, using `EmbeddingData`;\n",
@@ -177,7 +192,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -333,13 +348,13 @@
     "    starting_id,\n",
     "    starting_timestamp,\n",
     ") = build_data_objects(\n",
-    "    dataset['train'],\n",
+    "    dataset['reference'],\n",
     "    embedder,\n",
     "    model_name,\n",
     "    model_version,\n",
     "    historical_initial_sample_id,\n",
     "    historical_initial_timestamp,\n",
-    "    prefix='train',\n",
+    "    prefix='reference',\n",
     "    with_prediction=True\n",
     ")\n",
     "historical_end_timestamp = starting_timestamp - 120"
@@ -357,7 +372,7 @@
     "    starting_id,\n",
     "    starting_timestamp,\n",
     ") = build_data_objects(\n",
-    "    dataset['validation'],\n",
+    "    dataset['first_production'],\n",
     "    embedder,\n",
     "    model_name,\n",
     "    model_version,\n",
@@ -379,7 +394,7 @@
     "    starting_id,\n",
     "    starting_timestamp,\n",
     ") = build_data_objects(\n",
-    "    dataset['test'],\n",
+    "    dataset['second_production'],\n",
     "    embedder,\n",
     "    model_name,\n",
     "    model_version,\n",
@@ -407,7 +422,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 11,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -485,12 +500,13 @@
     "- **task_type:** artificial intelligence task type. In this case it is a RAG task.\n",
     "- **data_structure:** the type of input data. Since we are dealing with text data, we set it to TEXT.\n",
     "- **optional_target:** rag tasks don't have a target, hence it must be set to True\n",
-    "- **text_language:** it is mandatory to specify the language used in the task."
+    "- **text_language:** it is mandatory to specify the language used in the task.\n",
+    "- **rag_context_separator**: the string used to separate different contexts. In this case it is None as the context is composed of a single sentence."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 13,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -501,13 +517,14 @@
     "    task_type=ml3_enums.TaskType.RAG,\n",
     "    data_structure=ml3_enums.DataStructure.TEXT,\n",
     "    optional_target=True, # Must be True in RAG tasks\n",
-    "    text_language=ml3_enums.TextLanguage.ENGLISH\n",
+    "    text_language=ml3_enums.TextLanguage.ENGLISH,\n",
+    "    rag_contexts_separator=None,\n",
     ")"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -525,7 +542,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -543,7 +560,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Historical data are available data that don't come from the production environment. We need to add them in order to set up the reference data of our model, which is needed by internal algorithms.\n",
+    "Reference data are uploaded as historical data i.e., any data that do not come from production.\n",
+    "Reference data are defined by their initial and final timestamps and they are used to configure the detection algorithms.\n",
     "\n",
     "Note that it is possible to add other historical data in any time."
    ]