Skip to content
This repository was archived by the owner on Feb 18, 2026. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/validate_configs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
if: ${{ github.base_ref != 'main' }}
env:
GRETEL_API_KEY: ${{ secrets.GRETEL_DEV_API_KEY }}
GRETEL_CLOUD_URL: "https://api-dev.gretel.cloud"
GRETEL_CLOUD_URL: "https://api.dev.gretel.ai"

# Run tests in PROD
- name: Unit tests
Expand Down
390 changes: 390 additions & 0 deletions docs/notebooks/safe-synthetics/data-fidelity-101.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,390 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "fOLFudd5CnhK"
},
"source": [
"<a target=\"_parent\" href=\"https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/safe-synthetics/data-fidelity-101.ipynb\">\n",
" <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
"</a>\n",
"\n",
"\n",
"# 🎯 Using Data Fidelity for Safe Synthetics:\n",
"\n",
"In this notebook, we show case M1 Requirements of **data fidelity**—a set of rules ensuring your synthetic data **follows specific constraints** as expected.\n",
"\n",
"The **TabFT model** works well out-of-the-box, but sometimes, your data needs to be **extra precise**. That’s where **data fidelity** comes in! \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XizoDHIqCnhM"
},
"source": [
"**The Four Golden Rules of Data Fidelity (m1 requirements)**\n",
"\n",
"1️⃣ **Rule #1:** Values in **Column A** should be embedded somewhere in **Column B**. For example, Email should include first and last name.\n",
"\n",
"2️⃣ **Rule #2:** **Column A > Column B**. For example, admit date should always be <= discharge date. \n",
"\n",
"3️⃣ **Rule #3:** **DateTime minimum < Column A (DateTime) < DateTime maximum** Enforce minimum & maximum constraints on datetime values. For example, transactions should occur between certain dates like 1/1/2024 and 10/25/2024. \n",
"\n",
"4️⃣ **Rule #4:** Generate **highly unique values**. such as 'ID' column. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Wb_yS_gwCnhM"
},
"source": [
"## 💾 Install Gretel SDK"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "APan6WVaCnhM",
"outputId": "06747c83-9c89-49d0-a1ec-ee22e7e9935a"
},
"outputs": [],
"source": [
"%%capture\n",
"\n",
"%pip install -U gretel-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🌐 Configure your Gretel Session"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from gretel_client.navigator_client import Gretel\n",
"\n",
"gretel = Gretel(api_key=\"prompt\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "s8GPEIW-CnhN"
},
"source": [
"## 🔬 Preview input data\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 236
},
"id": "u8PoIqkVCnhN",
"outputId": "2d17f068-1f82-4ba6-fef0-3f81621d2bbf"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"data_source = \"https://gretel-datasets.s3.us-west-2.amazonaws.com/car_accident_5k.csv\" # cited papers: [Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019. & Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. \"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.\" In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.]\n",
"df = pd.read_csv(data_source)\n",
"\n",
"print(f\"Number of rows: {len(df)}\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "t83RFJB8CnhO"
},
"source": [
"## Understanding the Constraints in the Car Accident Data:\n",
"Following constraints exist in our real-world dataset, and we want to make sure they are replicated in the synthetic data:\n",
"\n",
"- Rule #1: The last digit in `ID` corresponds to the `Severity` value of the accident.\n",
"- Rule #2: `End_Time` > `Start_Time` (Accident's end time must always be after start time).\n",
"- Rule #3: `Birth_Date` must be between `1931-01-03` and `1997-12-27` to ensure realistic personas in the car accident.\n",
"- Rule #4: `ID` values follow a strict pattern and are highly unique values. They start with `A-` and have the numeric values to be in range of (3100001, 7999994)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "USbE9MrjCnhO"
},
"source": [
"## 🏃 Run Safe Synthetics with data fidelity:\n",
"Now, let's spin up a TabFT job that follows the above rules."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ac1lvIEsCnhO",
"outputId": "209449e5-fccb-43c0-9636-7926e6563c78"
},
"outputs": [],
"source": [
"workflow_run = gretel.safe_synthetic_dataset\\\n",
".from_data_source(df)\\\n",
".synthesize(\n",
" \"tabular_ft\",\n",
"\n",
" config={\n",
"\n",
" \"train\": {\n",
" \"data_config\": {\n",
" \"columns\": [ # Any DateTime columns should be specified with the type datetime and desired format.\n",
" {\n",
" \"name\": \"Start_Time\",\n",
" \"type\": \"datetime\",\n",
" \"format\": \"%Y-%m-%d %H:%M:%S\",\n",
" },\n",
" {\n",
" \"name\": \"End_Time\",\n",
" \"type\": \"datetime\",\n",
" \"format\": \"%Y-%m-%d %H:%M:%S\",\n",
" },\n",
" {\n",
" \"name\": \"Driver_Birth_Date\",\n",
" \"type\": \"datetime\",\n",
" \"format\": \"%Y-%m-%d\",\n",
" },\n",
" ],\n",
"\n",
" \"actions\": [\n",
"\n",
" { \"type\": \"date_constraint\", # Make sure the start time is before the end time. Rule #2\n",
" \"colA\": \"Start_Time\",\n",
" \"colB\": \"End_Time\",\n",
" \"operator\": \"lt\"\n",
" },\n",
"\n",
" { \"type\": \"expression_drop\", # Make sure the driver's birth date is in the same range as the training data. Rule #3\n",
" \"conditions\": [\"(row.Driver_Birth_Date | date_parse) > ('1997-12-27' | date_parse) or ((row.Driver_Birth_Date | date_parse) < ('1931-01-03'| date_parse))\"]\n",
" },\n",
"\n",
" { \"type\": \"replace_datasource\", # Rule #1 and #4.\n",
" \"col\": \"ID\",\n",
" \"data_source\" : {\n",
" \"type\": \"expression\",\n",
" \"expression\": '\"A-\" + (random.randint(310000,799999) | string)+ (row.Severity | string )',\n",
" } ,\n",
"\n",
" },\n",
" ],\n",
" }\n",
" },\n",
" \"generate\": {\n",
" \"num_records\": 1000,\n",
" },\n",
" },\n",
")\\\n",
".create()\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"workflow_run.wait_until_done()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 📊 Preview report"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"workflow_run.report.table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🔬 Preview output data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 321
},
"id": "SD1Bl2BsCnhO",
"outputId": "18d3020d-b244-4929-d709-9b11e14073c8"
},
"outputs": [],
"source": [
"# You can check the generated datasets using either of the following methods:\n",
"#1: Using the completed job from this notebook:\n",
"generated_df = workflow_run.dataset.df\n",
"#2: get the workflow run ID from console: (This is usually helpful if the notebook is interupted and you don't have access to `workflow_run`)\n",
"# workflow = gretel.workflows.get_workflow_run(\"your_worflow_run_ID\")\n",
"\n",
"generated_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cfdB4wphKOVe"
},
"source": [
"# 🔎 Validate Rules\n",
"Let's ensure our synthetic dataset adheres to the defined constraints and that each rule evaluates to True:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "GMjSb1hdKL_J",
"outputId": "1f9463fe-f802-464f-c3e5-9b498eaad541"
},
"outputs": [],
"source": [
"\n",
"# Rule #1\n",
"(generated_df['ID'].apply(lambda x: int(str(x)[-1])) == generated_df.Severity).all()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "d7qFaP8mFCY4",
"outputId": "1fccfc1e-8945-4f3b-96d6-7a82ee4b64d9"
},
"outputs": [],
"source": [
"\n",
"# Rule #2:\n",
"(generated_df.Start_Time < generated_df.End_Time).all()\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "m-I4S-JHFHab",
"outputId": "c42d3b97-4fba-4c6c-8555-a3d8f2db41d7"
},
"outputs": [],
"source": [
"# Rule #3:\n",
"generated_df['Driver_Birth_Date'] = pd.to_datetime(generated_df['Driver_Birth_Date'], errors='coerce')\n",
"((generated_df.Driver_Birth_Date <= pd.to_datetime(\"1997-12-27\")) & (generated_df.Driver_Birth_Date >= pd.to_datetime(\"1931-01-03\"))).all()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7qQF-m3eFI77",
"outputId": "ecdb8bb6-bef0-4c2b-bb4f-c0561c73b888"
},
"outputs": [],
"source": [
"\n",
"#Rule #4:\n",
"# Part 1: Highly unique values (>95%) of ID:\n",
"(generated_df[\"ID\"].nunique()/len(generated_df)) > 0.95"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "jSk8EYtQFJ_4",
"outputId": "88b314bb-6e63-4ba4-9eae-ccc4ffbb5e95"
},
"outputs": [],
"source": [
"#Rule #4:\n",
"# Part 2: IDs following specific pattern starting with \"A-\":\n",
"generated_df[\"ID\"].apply(lambda x:x.startswith(\"A-\")).all()\n",
"\n",
"# Numeric section of the ID should be within a certain range:\n",
"min_range = 3100001\n",
"max_range = 7999994\n",
"generated_df[\"ID\"].apply(lambda x:int(x.split(\"-\")[1])).between(min_range, max_range).all()\n"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}