gretelai · Marjan-emd · Aug 12, 2025 · Aug 13, 2025 · Aug 13, 2025
diff --git a/.github/workflows/validate_configs.yml b/.github/workflows/validate_configs.yml
@@ -27,7 +27,7 @@ jobs:
         if: ${{ github.base_ref != 'main' }}
         env:
           GRETEL_API_KEY: ${{ secrets.GRETEL_DEV_API_KEY }}
-          GRETEL_CLOUD_URL: "https://api-dev.gretel.cloud"
+          GRETEL_CLOUD_URL: "https://api.dev.gretel.ai"
 
       # Run tests in PROD
       - name: Unit tests

diff --git a/docs/notebooks/safe-synthetics/data-fidelity-101.ipynb b/docs/notebooks/safe-synthetics/data-fidelity-101.ipynb
@@ -0,0 +1,390 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "fOLFudd5CnhK"
+      },
+      "source": [
+        "<a target=\"_parent\" href=\"https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/safe-synthetics/data-fidelity-101.ipynb\">\n",
+        "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+        "</a>\n",
+        "\n",
+        "\n",
+        "# 🎯 Using Data Fidelity for Safe Synthetics:\n",
+        "\n",
+        "In this notebook, we show case M1 Requirements of **data fidelity**—a set of rules ensuring your synthetic data **follows specific constraints** as expected.\n",
+        "\n",
+        "The **TabFT model** works well out-of-the-box, but sometimes, your data needs to be **extra precise**. That’s where **data fidelity** comes in! \n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "XizoDHIqCnhM"
+      },
+      "source": [
+        "**The Four Golden Rules of Data Fidelity (m1 requirements)**\n",
+        "\n",
+        "1️⃣ **Rule #1:** Values in **Column A** should be embedded somewhere in **Column B**. For example, Email should include first and last name.\n",
+        "\n",
+        "2️⃣ **Rule #2:** **Column A > Column B**. For example, admit date should always be <= discharge date.  \n",
+        "\n",
+        "3️⃣ **Rule #3:** **DateTime minimum < Column A (DateTime) < DateTime maximum** Enforce minimum & maximum constraints on datetime values. For example, transactions should occur between certain dates like 1/1/2024 and 10/25/2024.  \n",
+        "\n",
+        "4️⃣ **Rule #4:** Generate **highly unique values**. such as 'ID' column. \n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Wb_yS_gwCnhM"
+      },
+      "source": [
+        "## 💾 Install Gretel SDK"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "APan6WVaCnhM",
+        "outputId": "06747c83-9c89-49d0-a1ec-ee22e7e9935a"
+      },
+      "outputs": [],
+      "source": [
+        "%%capture\n",
+        "\n",
+        "%pip install -U gretel-client"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 🌐 Configure your Gretel Session"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from gretel_client.navigator_client import Gretel\n",
+        "\n",
+        "gretel = Gretel(api_key=\"prompt\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "s8GPEIW-CnhN"
+      },
+      "source": [
+        "## 🔬 Preview input data\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 236
+        },
+        "id": "u8PoIqkVCnhN",
+        "outputId": "2d17f068-1f82-4ba6-fef0-3f81621d2bbf"
+      },
+      "outputs": [],
+      "source": [
+        "import pandas as pd\n",
+        "\n",
+        "data_source = \"https://gretel-datasets.s3.us-west-2.amazonaws.com/car_accident_5k.csv\" # cited papers: [Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019. & Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.]\n",
+        "df = pd.read_csv(data_source)\n",
+        "\n",
+        "print(f\"Number of rows: {len(df)}\")\n",
+        "df.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "t83RFJB8CnhO"
+      },
+      "source": [
+        "## Understanding the Constraints in the Car Accident Data:\n",
+        "Following constraints exist in our real-world dataset, and we want to make sure they are replicated in the synthetic data:\n",
+        "\n",
+        "- Rule #1: The last digit in `ID` corresponds to the `Severity` value of the accident.\n",
+        "- Rule #2: `End_Time` > `Start_Time` (Accident's end time must always be after start time).\n",
+        "- Rule #3: `Birth_Date` must be between `1931-01-03` and `1997-12-27` to ensure realistic personas in the car accident.\n",
+        "- Rule #4: `ID` values follow a strict pattern and are highly unique values. They start with `A-` and have the numeric values to be in range of (3100001, 7999994)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "USbE9MrjCnhO"
+      },
+      "source": [
+        "## 🏃 Run Safe Synthetics with data fidelity:\n",
+        "Now, let's spin up a TabFT job that follows the above rules."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "ac1lvIEsCnhO",
+        "outputId": "209449e5-fccb-43c0-9636-7926e6563c78"
+      },
+      "outputs": [],
+      "source": [
+        "workflow_run = gretel.safe_synthetic_dataset\\\n",
+        ".from_data_source(df)\\\n",
+        ".synthesize(\n",
+        "    \"tabular_ft\",\n",
+        "\n",
+        "    config={\n",
+        "\n",
+        "        \"train\": {\n",
+        "            \"data_config\": {\n",
+        "                \"columns\": [ # Any DateTime columns should be specified with the type datetime and desired format.\n",
+        "                    {\n",
+        "                        \"name\": \"Start_Time\",\n",
+        "                        \"type\": \"datetime\",\n",
+        "                        \"format\": \"%Y-%m-%d %H:%M:%S\",\n",
+        "                    },\n",
+        "                    {\n",
+        "                        \"name\": \"End_Time\",\n",
+        "                        \"type\": \"datetime\",\n",
+        "                        \"format\": \"%Y-%m-%d %H:%M:%S\",\n",
+        "                    },\n",
+        "                    {\n",
+        "                        \"name\": \"Driver_Birth_Date\",\n",
+        "                        \"type\": \"datetime\",\n",
+        "                        \"format\": \"%Y-%m-%d\",\n",
+        "                    },\n",
+        "                ],\n",
+        "\n",
+        "                \"actions\": [\n",
+        "\n",
+        "                    {   \"type\": \"date_constraint\", # Make sure the start time is before the end time. Rule #2\n",
+        "                        \"colA\": \"Start_Time\",\n",
+        "                        \"colB\": \"End_Time\",\n",
+        "                        \"operator\": \"lt\"\n",
+        "                    },\n",
+        "\n",
+        "                    {  \"type\": \"expression_drop\", # Make sure the driver's birth date is in the same range as the training data. Rule #3\n",
+        "                        \"conditions\": [\"(row.Driver_Birth_Date | date_parse) > ('1997-12-27' | date_parse) or ((row.Driver_Birth_Date | date_parse) < ('1931-01-03'| date_parse))\"]\n",
+        "                    },\n",
+        "\n",
+        "                    { \"type\": \"replace_datasource\", # Rule #1 and #4.\n",
+        "                        \"col\": \"ID\",\n",
+        "                        \"data_source\" : {\n",
+        "                            \"type\": \"expression\",\n",
+        "                            \"expression\": '\"A-\" + (random.randint(310000,799999) | string)+ (row.Severity | string )',\n",
+        "                        } ,\n",
+        "\n",
+        "                    },\n",
+        "                ],\n",
+        "            }\n",
+        "        },\n",
+        "        \"generate\": {\n",
+        "            \"num_records\": 1000,\n",
+        "        },\n",
+        "    },\n",
+        ")\\\n",
+        ".create()\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "workflow_run.wait_until_done()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 📊 Preview report"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "workflow_run.report.table"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 🔬 Preview output data"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 321
+        },
+        "id": "SD1Bl2BsCnhO",
+        "outputId": "18d3020d-b244-4929-d709-9b11e14073c8"
+      },
+      "outputs": [],
+      "source": [
+        "# You can check the generated datasets using either of the following methods:\n",
+        "#1: Using the completed job from this notebook:\n",
+        "generated_df = workflow_run.dataset.df\n",
+        "#2: get the workflow run ID from console: (This is usually helpful if the notebook is interupted and you don't have access to `workflow_run`)\n",
+        "# workflow = gretel.workflows.get_workflow_run(\"your_worflow_run_ID\")\n",
+        "\n",
+        "generated_df.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "cfdB4wphKOVe"
+      },
+      "source": [
+        "# 🔎 Validate Rules\n",
+        "Let's ensure our synthetic dataset adheres to the defined constraints and that each rule evaluates to True:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "GMjSb1hdKL_J",
+        "outputId": "1f9463fe-f802-464f-c3e5-9b498eaad541"
+      },
+      "outputs": [],
+      "source": [
+        "\n",
+        "# Rule #1\n",
+        "(generated_df['ID'].apply(lambda x: int(str(x)[-1])) == generated_df.Severity).all()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "d7qFaP8mFCY4",
+        "outputId": "1fccfc1e-8945-4f3b-96d6-7a82ee4b64d9"
+      },
+      "outputs": [],
+      "source": [
+        "\n",
+        "# Rule #2:\n",
+        "(generated_df.Start_Time < generated_df.End_Time).all()\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "m-I4S-JHFHab",
+        "outputId": "c42d3b97-4fba-4c6c-8555-a3d8f2db41d7"
+      },
+      "outputs": [],
+      "source": [
+        "# Rule #3:\n",
+        "generated_df['Driver_Birth_Date'] = pd.to_datetime(generated_df['Driver_Birth_Date'], errors='coerce')\n",
+        "((generated_df.Driver_Birth_Date <= pd.to_datetime(\"1997-12-27\")) & (generated_df.Driver_Birth_Date >= pd.to_datetime(\"1931-01-03\"))).all()\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "7qQF-m3eFI77",
+        "outputId": "ecdb8bb6-bef0-4c2b-bb4f-c0561c73b888"
+      },
+      "outputs": [],
+      "source": [
+        "\n",
+        "#Rule #4:\n",
+        "# Part 1: Highly unique values (>95%) of ID:\n",
+        "(generated_df[\"ID\"].nunique()/len(generated_df)) > 0.95"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "jSk8EYtQFJ_4",
+        "outputId": "88b314bb-6e63-4ba4-9eae-ccc4ffbb5e95"
+      },
+      "outputs": [],
+      "source": [
+        "#Rule #4:\n",
+        "# Part 2: IDs following specific pattern starting with \"A-\":\n",
+        "generated_df[\"ID\"].apply(lambda x:x.startswith(\"A-\")).all()\n",
+        "\n",
+        "# Numeric section of the ID should be within a certain range:\n",
+        "min_range = 3100001\n",
+        "max_range = 7999994\n",
+        "generated_df[\"ID\"].apply(lambda x:int(x.split(\"-\")[1])).between(min_range, max_range).all()\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": ".venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.9"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}