Textual edits to tool and changes to notebook

jfparie · jfparie · commit 617494f530e4 · 2025-03-17T16:30:33.000+01:00
diff --git a/notebooks/CART_LawSchoolAdmissionBar.ipynb b/notebooks/CART_LawSchoolAdmissionBar.ipynb
@@ -333,7 +333,7 @@
     "\n",
     "[table]\n",
     "\n",
-    "The CART method is used to generate the synthetic data. CART generally produces higher quality synthetic datasets, but might not run on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.\n",
+    "The CART method is used to generate the synthetic data. CART generally produces high quality synthetic data, but might not work well on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.\n",
     "\n",
     "*The original paper can be found [here](https://files.eric.ed.gov/fulltext/ED469370.pdf)."
    ]
@@ -557,6 +557,19 @@
     "### 1. Data types detection"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**UI text #2**\n",
+    "\n",
+    "The following missing data is detected:\n",
+    "\n",
+    "[output]\n",
+    "\n",
+    "If the detected data types are incorrect, please change this locally in the source dataset before attaching it to the web app."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -578,49 +591,27 @@
     "print(\"Column Data Types:\", column_dtypes)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**UI text #2**\n",
-    "\n",
-    "If the detected data types are incorrect, please change this locally in the source dataset before attaching it to the app."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### 2. Missing data handler"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Detected Missingness Type: {'sex': 'MAR', 'race1': 'MAR'}\n"
-     ]
-    }
-   ],
-   "source": [
-    "# Detect missingness\n",
-    "missingness_dict = md_handler.detect_missingness(df)\n",
-    "print(\"Detected Missingness Type:\", missingness_dict)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "**UI text #3**\n",
     "\n",
+    "The following type of missing data is detected:\n",
+    "\n",
+    "[output]\n",
+    "\n",
     "For Missing At Random (MAR) and Missing Not At Random (MNAR) data, we recommend to impute the missing data. For Missing Completely At Random (MCAR), we recommend to remove the missing data. See the info box for more information.\n",
     "\n",
+    "In this demo, the missing data is imputed.\n",
+    "\n",
     "_info box:_\n",
     "\n",
     "MCAR, MAR, and MNAR are terms used to describe different mechanisms of missing data:\n",
@@ -644,6 +635,25 @@
     "- Recommendation: impute missing data."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Detected Missingness Type: {'sex': 'MAR', 'race1': 'MAR'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Detect missingness\n",
+    "missingness_dict = md_handler.detect_missingness(df)\n",
+    "print(\"Detected Missingness Type:\", missingness_dict)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 10,
@@ -671,23 +681,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 3. Pre-processing data"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**UI text #4**\n",
-    "\n",
-    "In the next step the data is pre-processed. The dataframe is transformed into numerical space. The following steps are performed:\n",
-    "\n",
-    "1. Validates the input data;\n",
-    "2. Stores the original column order;\n",
-    "3. Encoding and scaling:\n",
-    "* Encodes categorical columns using LabelEncoder or OneHotEncoder;\n",
-    "* Scales numerical columns using StandardScaler;\n",
-    "* Converts boolean columns to integers."
+    "### [no section] Pre-processing data"
    ]
   },
   {
@@ -797,7 +791,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 4. Synthetic data generation: {CART/GC}"
+    "### 3. Synthesizer: CART"
    ]
   },
   {
@@ -914,18 +908,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**UI text #5**\n",
+    "**UI text #4**\n",
     "\n",
     "{n_synth_data} synthetic data points are generated using CART. \n",
     "\n",
-    "The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points. Then, the the synthetic data back to the original format (postprocessing)."
+    "The CART (Classification and Regression Trees) method generates synthetic data by learning patterns from real data through a decision tree that splits data into homogeneous groups based on feature values. It predicts averages for numerical data and assigns the most common category for categorical data, using these predictions to create new synthetic points."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 5. Generated synthetic data"
+    "### [no section] Generated synthetic data"
    ]
   },
   {
@@ -1031,7 +1025,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 6. Evaluation of generated data"
+    "### 4. Evaluation of generated data"
    ]
   },
   {
@@ -1214,9 +1208,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**UI text #6**\n",
+    "**UI text #5**\n",
     "\n",
-    "{n_synth_data} synthetic data points are generated using CART. The figures below display the differences in value frequency for each variable. The synthetic data is of high quality when all bars are of equal height."
+    "{n_synth_data} synthetic data points are generated using CART. The figures below display the value frequency for each variable. The synthetic data is of high quality when the frequencies are approximately the same."
    ]
   },
   {
@@ -1294,19 +1288,19 @@
    "source": [
     "**UI text #6**\n",
     "\n",
-    "The report computes the following diagnostic results for each column:\n",
-    "- For numerical (or datetime) columns:\n",
-    "    * *Missing value similarity:* Similarity in the proportion of missing values.\n",
-    "    * *Range coverage:* Proportion of the real data's range covered by the synthetic data.\n",
-    "    * *Boundary adherence:* Fraction of synthetic values within the real data's min/max.\n",
-    "    * *Kolmogorov–Smirnov (KS) complement:* Uses the two-sample Kolmogorov–Smirnov test to compare the distributions of the two continuous columns using the empirical CDF. It returns 1 minus the KS Test D statistic, which indicates the maximum distance between the expected CDF and the observed CDF values.\n",
-    "    * *Statistic similarity:* Similarity of mean, std, and median.\n",
-    "- For categorical (or boolean) columns:\n",
-    "    * *Missing value similarity:* Similarity in the proportion of missing values.\n",
-    "    * *Total variation (TV) complement:* Compute the complement of the total variation distance of two discrete columns.\n",
-    "    * *Category coverage:* Proportion of real categories found in synthetic data.\n",
-    "    * *Category adherence:* Fraction of synthetic values that are valid real categories.\n",
+    "For each column, diagnostic results are computed for the quality of the generated synthetic data. The computed metrics depend on the type of data. \n",
     "\n",
+    "- For numerical (or datetime) columns the following metrics are computed:\n",
+    "    * Missing value similarity *Infobox*: Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;\n",
+    "    * Range coverage *Infobox*: Measures whether a synthetic column covers the full range of values that are present in a real column;\n",
+    "    * Boundary adherence *Infobox*: Measures whether a synthetic column respects the minimum and maximum values of the real column. It returns the percentage of synthetic rows that adhere to the real boundaries;\n",
+    "    * Statistic similarity *Infobox*: Measures the similarity between real column and a synthetic column by comparing the mean, standard deviation and median;\n",
+    "    * Kolmogorov–Smirnov (KS) complement *Infobox*: Computes the similarity of a real and synthetic numerical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.\n",
+    "- For categorical (or boolean) columns the following metrics are computed:\n",
+    "    * Missing value similarity *Infobox*: Compares whether the synthetic data has the same proportion of missing values as the real data for a given column;\n",
+    "    * Category coverage *Infobox*: Measures whether a synthetic column covers all the possible categories that are present in a real column;\n",
+    "    * Category adherence *Infobox*: Measures whether a synthetic column adheres to the same category values as the real data;\n",
+    "    * Total variation (TV) complement *Infobox*: Computes the similarity of a real and synthetic categorical column in terms of the column shapes, i.e., the marginal distribution or 1D histogram of the column.\n",
     "\n",
     "💯 All values need to be close to 1.0 "
    ]
@@ -1793,7 +1787,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**UI text #7**\n",
+    "**UI text #9**\n",
     "\n",
     "Do you want to learn more about synthetic data?\n",
     "- Source code of this tool:\n",
@@ -1805,6 +1799,11 @@
     "- [CART: synthpop resources](https://synthpop.org.uk/resources.html)\n",
     "- [Gaussian Copula - Synthetic Data Vault](https://docs.sdv.dev/sdv)\n"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
   }
  ],
  "metadata": {
diff --git a/src/locales/en.json b/src/locales/en.json
@@ -62,8 +62,8 @@
     "syntheticData": {
         "demo": {
             "heading": "Information about demo dataset",
-            "description": "A subset of the [Law School Admission Bar](https://www.kaggle.com/datasets/danofer/law-school-admissions-bar-passage)* dataset is used as a demo. Synthetic data will be generated for the following columns:\n \n&nbsp;&nbsp;\n",
-            "post.description": "The CART method is used to generate the synthetic data. CART generally produces higher quality synthetic datasets, but might not work well on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.\n  \n&nbsp;&nbsp;\n\n*The original paper can be found [here](https://files.eric.ed.gov/fulltext/ED469370.pdf)\n \n&nbsp;&nbsp;\n",
+            "description": "A subset of the [Law School Admission Bar](https://www.kaggle.com/datasets/danofer/law-school-admissions-bar-passage)* dataset is used as a demo. Synthetic data will be generated for the following variables:\n \n&nbsp;&nbsp;\n",
+            "post.description": "The CART method is used to generate the synthetic data. CART generally produces high quality synthetic data, but might not work well on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases.\n  \n&nbsp;&nbsp;\n\n*The original paper can be found [here](https://files.eric.ed.gov/fulltext/ED469370.pdf)\n \n&nbsp;&nbsp;\n",
             "data.column.Variable_name": "Variable name",
             "data.sex": "sex",
             "data.race1": "race1",
@@ -94,14 +94,14 @@
                 "columnsCountError": "File may contain a maximum of 8 columns."
             },
             "fieldset": {
-                "sourceDataset": "Source dataset",
+                "sourceDataset": "Source data",
                 "sdgMethod": {
                     "title": "Method",
                     "cart": "CART",
                     "gc": "Gaussian Copula",
-                    "tooltip": "The CART method is used to generate the synthetic data.\n  \n  \n  \nCART generally produces higher quality synthetic datasets, but might not work well on datasets with categorical variables with 20+ categories.\n  \n  \n  \nUse Gaussian Copula in those cases."
+                    "tooltip": "By default, the CART method is used to generate synthetic data. CART generally produces higher quality synthetic data, but might not work well on datasets with categorical variables with 20+ categories. Use Gaussian Copula in those cases."
                 },
-                "samples": "Number of samples"
+                "samples": "Number of synthetic datapoints"
             },
             "actions": {
                 "tryItOut": "Try it out",
@@ -112,9 +112,9 @@
         },
         "demoCard": {
             "title": "Try it out!",
-            "description": "Do you not have a dataset at hand? No worries use our demo dataset."
+            "description": "No dataset at hand? Use our demo dataset."
         },
-        "columnsInDatasetInfo": "If the detected data types are incorrect, please change this locally in the source dataset before attaching it to the app.",
+        "columnsInDatasetInfo": "If the detected data types are incorrect, please change this locally in the source dataset before attaching it to the web app.",
         "univariateCharts": "Univariate distributions",
         "bivariateDistributionRealData": "Bivariate distribution",
         "univariateDistributionSyntheticData": "Univariate distribution",
@@ -133,7 +133,7 @@
         "outputDataTitle": "4. Generated synthetic data",
         "diagnosticsTitle": "Diagnostic Results",
         "correlationDifference": "Correlation difference: {{correlationDifference}}",
-        "univariateText": "{{samples}} synthetic data points are generated using CART. The figures below display the differences in value frequency for each variable. The synthetic data is of high quality when all bars are of equal height.",
+        "univariateText": "{{samples}} synthetic data points are generated using CART. The figures below display the value frequency for each variable. The synthetic data is of high quality when the frequencies are approximately the same.",
         "bivariateText": "The figures below display the differences in value frequency for a combination of variables. For comparing two categorical variables, bar charts are plotted. For comparing a numerical and a categorical variables, a so called [violin plot](https://en.wikipedia.org/wiki/Violin_plot) is shown. For comparing two numercial variables, a [LOESS plot](https://en.wikipedia.org/wiki/Local_regression) is created. For all plots holds: the synthetic data is of high quality when the shape of the distributions in the synthetic data equal the distributions in the real data.",
         "moreInfo": "&nbsp;&nbsp;\n  \n  \n  \nDo you want to learn more about synthetic data?\n  \n  \n  \n- [python-synthpop on Github](https://github.com/NGO-Algorithm-Audit/python-synthpop)\n- [local-first web app on Github](https://github.com/NGO-Algorithm-Audit/local-first-web-tool/tree/main)\n- [Synthetic Data: what, why and how?](https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_Survey-24.pdf)\n- [Knowledge Network Synthetic Data](https://online.rijksinnovatiecommunity.nl/groups/399-kennisnetwerk-synthetischedata/welcome) (for Dutch public organizations)\n- [Synthetic data portal of Dutch Executive Agency for Education](https://duo.nl/open_onderwijsdata/footer/synthetische-data.jsp) (DUO)\n- [CART: synthpop resources](https://synthpop.org.uk/resources.html)\n- [Gaussian Copula - Synthetic Data Vault](https://docs.sdv.dev/sdv)"
     },
diff --git a/src/locales/nl.json b/src/locales/nl.json
@@ -61,8 +61,8 @@
     "syntheticData": {
         "demo": {
             "heading": "Informatie over demodataset",
-            "description": "Een subset van de [Law School Admission Bar](https://www.kaggle.com/datasets/danofer/law-school-admissions-bar-passage)* dataset wordt gebruikt als demo. Synthetische data worden gegenereerd voor de volgende kolommen:\n  \n&nbsp;&nbsp;\n\n",
-            "post.description": "De CART-methode zal worden gebruikt om de verschillen in distributie en correlatie tussen de echte en synthetische gegevens te evalueren.\n  \n&nbsp;&nbsp;\n\n*Het oorspronkelijke artikel is [hier](https://files.eric.ed.gov/fulltext/ED469370.pdf) te vinden.",
+            "description": "Een subset van de [Law School Admission Bar](https://www.kaggle.com/datasets/danofer/law-school-admissions-bar-passage)* dataset wordt gebruikt als demo. Synthetische data worden gegenereerd voor de volgende variablen:\n  \n&nbsp;&nbsp;\n\n",
+            "post.description": "De CART-methode wordt gebruikt om synthetische gegevens te genereren.\n CART produceert doorgaan een goede kwaliteit synthetische data, maar werkt minder goed voor data met categorische data met meer dan 20 categorieën. Gebruik in dit geval Gaussian Copula. \n&nbsp;&nbsp;\n\n*Het oorspronkelijke artikel is [hier](https://files.eric.ed.gov/fulltext/ED469370.pdf) te vinden.",
             "data.column.Variable_name": "Variabele name",
             "data.sex": "sex",
             "data.race1": "race1",
@@ -79,7 +79,7 @@
             "data.column.Values": "Waardes",
 
             "data.column.Values.sex": "1 (man), 2 (vrouw)",
-            "data.column.Values.race": "aziatisch, zwart, hispanic, wit, anders",
+            "data.column.Values.race": "aziatisch, afrikaans, latino, westers, anders",
             "data.column.Values.ugpa": "1-4",
             "data.column.Values.bar": "geslaagd 1e keer, geslaagd 2e keer, gezakt, niet-afgestudeerd"
         },
@@ -93,26 +93,26 @@
                 "columnsCountError": "File mag maximaal 8 kolommen bevatten."
             },
             "fieldset": {
-                "sourceDataset": "Brondataset",
+                "sourceDataset": "Brondata",
                 "sdgMethod": {
                     "title": "Methode",
                     "cart": "CART",
                     "gc": "Gaussian Copula",
-                    "tooltip": "De CART-methode wordt gebruikt om de synthetische gegevens te genereren.\n  \n  \n  \nCART levert over het algemeen synthetische datasets van hogere kwaliteit op, maar werkt mogelijk niet goed bij datasets met categorische variabelen met meer dan 20 categorieën.\n  \n  \n  \nGebruik in die gevallen de Gaussian Copula."
+                    "tooltip": "In principe wordt de CART-methode gebruikt om synthetische data te genereren. CART levert over het algemeen synthetische data van hoge kwaliteit, maar werkt mogelijk niet goed bij datasets met categorische variabelen met meer dan 20 categorieën. Gebruik in die gevallen de Gaussian Copula."
                 },
-                "samples": "Aantal samples"
+                "samples": "Aantal synthetische datapunten"
             },
             "actions": {
                 "tryItOut": "Uitproberen",
                 "runGeneration": "Start synthetische data generatie",
                 "analyzing": "Analyseren...",
                 "initializing": "Initialiseren..."
             },
-            "univariateText": "{{samples}} synthetic data punten zijn gegeneert met CART. De grafieken tonen de verschillen in waarde frequentie voor elle variabele. De synthetische data is van hoge kwaliteit als alle balken van gelijke hoogte zijn."
+            "univariateText": "{{samples}} synthetic datapunten via de CART-methode gegeneerd. De grafieken tonen de frequentie waarmee een variabele een bepaalde waarde aanneemt. De synthetische data is van hoge kwaliteit als de frequenties ongeveer gelijke zijn."
         },
         "demoCard": {
             "title": "Probeer het uit!",
-            "description": "Heeft u geen dataset bij de hand? Geen zorgen, gebruik onze demodataset."
+            "description": "Geen dataset bij de hand? Gebruik onze demodata."
         },
         "columnsInDatasetInfo": "Als de gedetecteerd data types niet correct zijn, pas dit dan lokaal aan in de dataset voordat u deze opnieuw aan de app koppelt.",
         "univariateCharts": "Univariate distributies",