diff --git a/.github/workflows/validate_configs.yml b/.github/workflows/validate_configs.yml index 41f9f135..4047c94d 100644 --- a/.github/workflows/validate_configs.yml +++ b/.github/workflows/validate_configs.yml @@ -27,7 +27,7 @@ jobs: if: ${{ github.base_ref != 'main' }} env: GRETEL_API_KEY: ${{ secrets.GRETEL_DEV_API_KEY }} - GRETEL_CLOUD_URL: "https://api-dev.gretel.cloud" + GRETEL_CLOUD_URL: "https://api.dev.gretel.ai" # Run tests in PROD - name: Unit tests diff --git a/docs/notebooks/safe-synthetics/data-fidelity-101.ipynb b/docs/notebooks/safe-synthetics/data-fidelity-101.ipynb new file mode 100644 index 00000000..aacb6ee2 --- /dev/null +++ b/docs/notebooks/safe-synthetics/data-fidelity-101.ipynb @@ -0,0 +1,390 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "fOLFudd5CnhK" + }, + "source": [ + "\n", + " \"Open\n", + "\n", + "\n", + "\n", + "# 🎯 Using Data Fidelity for Safe Synthetics:\n", + "\n", + "In this notebook, we show case M1 Requirements of **data fidelity**β€”a set of rules ensuring your synthetic data **follows specific constraints** as expected.\n", + "\n", + "The **TabFT model** works well out-of-the-box, but sometimes, your data needs to be **extra precise**. That’s where **data fidelity** comes in! \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XizoDHIqCnhM" + }, + "source": [ + "**The Four Golden Rules of Data Fidelity (m1 requirements)**\n", + "\n", + "1️⃣ **Rule #1:** Values in **Column A** should be embedded somewhere in **Column B**. For example, Email should include first and last name.\n", + "\n", + "2️⃣ **Rule #2:** **Column A > Column B**. For example, admit date should always be <= discharge date. \n", + "\n", + "3️⃣ **Rule #3:** **DateTime minimum < Column A (DateTime) < DateTime maximum** Enforce minimum & maximum constraints on datetime values. For example, transactions should occur between certain dates like 1/1/2024 and 10/25/2024. \n", + "\n", + "4️⃣ **Rule #4:** Generate **highly unique values**. such as 'ID' column. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wb_yS_gwCnhM" + }, + "source": [ + "## πŸ’Ύ Install Gretel SDK" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "APan6WVaCnhM", + "outputId": "06747c83-9c89-49d0-a1ec-ee22e7e9935a" + }, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "%pip install -U gretel-client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🌐 Configure your Gretel Session" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from gretel_client.navigator_client import Gretel\n", + "\n", + "gretel = Gretel(api_key=\"prompt\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s8GPEIW-CnhN" + }, + "source": [ + "## πŸ”¬ Preview input data\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 236 + }, + "id": "u8PoIqkVCnhN", + "outputId": "2d17f068-1f82-4ba6-fef0-3f81621d2bbf" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "data_source = \"https://gretel-datasets.s3.us-west-2.amazonaws.com/car_accident_5k.csv\" # cited papers: [Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. β€œA Countrywide Traffic Accident Dataset.”, 2019. & Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.]\n", + "df = pd.read_csv(data_source)\n", + "\n", + "print(f\"Number of rows: {len(df)}\")\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "t83RFJB8CnhO" + }, + "source": [ + "## Understanding the Constraints in the Car Accident Data:\n", + "Following constraints exist in our real-world dataset, and we want to make sure they are replicated in the synthetic data:\n", + "\n", + "- Rule #1: The last digit in `ID` corresponds to the `Severity` value of the accident.\n", + "- Rule #2: `End_Time` > `Start_Time` (Accident's end time must always be after start time).\n", + "- Rule #3: `Birth_Date` must be between `1931-01-03` and `1997-12-27` to ensure realistic personas in the car accident.\n", + "- Rule #4: `ID` values follow a strict pattern and are highly unique values. They start with `A-` and have the numeric values to be in range of (3100001, 7999994)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "USbE9MrjCnhO" + }, + "source": [ + "## πŸƒ Run Safe Synthetics with data fidelity:\n", + "Now, let's spin up a TabFT job that follows the above rules." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ac1lvIEsCnhO", + "outputId": "209449e5-fccb-43c0-9636-7926e6563c78" + }, + "outputs": [], + "source": [ + "workflow_run = gretel.safe_synthetic_dataset\\\n", + ".from_data_source(df)\\\n", + ".synthesize(\n", + " \"tabular_ft\",\n", + "\n", + " config={\n", + "\n", + " \"train\": {\n", + " \"data_config\": {\n", + " \"columns\": [ # Any DateTime columns should be specified with the type datetime and desired format.\n", + " {\n", + " \"name\": \"Start_Time\",\n", + " \"type\": \"datetime\",\n", + " \"format\": \"%Y-%m-%d %H:%M:%S\",\n", + " },\n", + " {\n", + " \"name\": \"End_Time\",\n", + " \"type\": \"datetime\",\n", + " \"format\": \"%Y-%m-%d %H:%M:%S\",\n", + " },\n", + " {\n", + " \"name\": \"Driver_Birth_Date\",\n", + " \"type\": \"datetime\",\n", + " \"format\": \"%Y-%m-%d\",\n", + " },\n", + " ],\n", + "\n", + " \"actions\": [\n", + "\n", + " { \"type\": \"date_constraint\", # Make sure the start time is before the end time. Rule #2\n", + " \"colA\": \"Start_Time\",\n", + " \"colB\": \"End_Time\",\n", + " \"operator\": \"lt\"\n", + " },\n", + "\n", + " { \"type\": \"expression_drop\", # Make sure the driver's birth date is in the same range as the training data. Rule #3\n", + " \"conditions\": [\"(row.Driver_Birth_Date | date_parse) > ('1997-12-27' | date_parse) or ((row.Driver_Birth_Date | date_parse) < ('1931-01-03'| date_parse))\"]\n", + " },\n", + "\n", + " { \"type\": \"replace_datasource\", # Rule #1 and #4.\n", + " \"col\": \"ID\",\n", + " \"data_source\" : {\n", + " \"type\": \"expression\",\n", + " \"expression\": '\"A-\" + (random.randint(310000,799999) | string)+ (row.Severity | string )',\n", + " } ,\n", + "\n", + " },\n", + " ],\n", + " }\n", + " },\n", + " \"generate\": {\n", + " \"num_records\": 1000,\n", + " },\n", + " },\n", + ")\\\n", + ".create()\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "workflow_run.wait_until_done()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ“Š Preview report" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "workflow_run.report.table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ”¬ Preview output data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 321 + }, + "id": "SD1Bl2BsCnhO", + "outputId": "18d3020d-b244-4929-d709-9b11e14073c8" + }, + "outputs": [], + "source": [ + "# You can check the generated datasets using either of the following methods:\n", + "#1: Using the completed job from this notebook:\n", + "generated_df = workflow_run.dataset.df\n", + "#2: get the workflow run ID from console: (This is usually helpful if the notebook is interupted and you don't have access to `workflow_run`)\n", + "# workflow = gretel.workflows.get_workflow_run(\"your_worflow_run_ID\")\n", + "\n", + "generated_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cfdB4wphKOVe" + }, + "source": [ + "# πŸ”Ž Validate Rules\n", + "Let's ensure our synthetic dataset adheres to the defined constraints and that each rule evaluates to True:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "GMjSb1hdKL_J", + "outputId": "1f9463fe-f802-464f-c3e5-9b498eaad541" + }, + "outputs": [], + "source": [ + "\n", + "# Rule #1\n", + "(generated_df['ID'].apply(lambda x: int(str(x)[-1])) == generated_df.Severity).all()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d7qFaP8mFCY4", + "outputId": "1fccfc1e-8945-4f3b-96d6-7a82ee4b64d9" + }, + "outputs": [], + "source": [ + "\n", + "# Rule #2:\n", + "(generated_df.Start_Time < generated_df.End_Time).all()\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "m-I4S-JHFHab", + "outputId": "c42d3b97-4fba-4c6c-8555-a3d8f2db41d7" + }, + "outputs": [], + "source": [ + "# Rule #3:\n", + "generated_df['Driver_Birth_Date'] = pd.to_datetime(generated_df['Driver_Birth_Date'], errors='coerce')\n", + "((generated_df.Driver_Birth_Date <= pd.to_datetime(\"1997-12-27\")) & (generated_df.Driver_Birth_Date >= pd.to_datetime(\"1931-01-03\"))).all()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7qQF-m3eFI77", + "outputId": "ecdb8bb6-bef0-4c2b-bb4f-c0561c73b888" + }, + "outputs": [], + "source": [ + "\n", + "#Rule #4:\n", + "# Part 1: Highly unique values (>95%) of ID:\n", + "(generated_df[\"ID\"].nunique()/len(generated_df)) > 0.95" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "jSk8EYtQFJ_4", + "outputId": "88b314bb-6e63-4ba4-9eae-ccc4ffbb5e95" + }, + "outputs": [], + "source": [ + "#Rule #4:\n", + "# Part 2: IDs following specific pattern starting with \"A-\":\n", + "generated_df[\"ID\"].apply(lambda x:x.startswith(\"A-\")).all()\n", + "\n", + "# Numeric section of the ID should be within a certain range:\n", + "min_range = 3100001\n", + "max_range = 7999994\n", + "generated_df[\"ID\"].apply(lambda x:int(x.split(\"-\")[1])).between(min_range, max_range).all()\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}